~/blog
Build Better with AI: Why Design, Evaluation and Real-World Testing Are Your Most Important Skills
AI can write code, generate content, run pipelines, and handle tasks that used to take days — in seconds.
So the question worth asking isn't "will AI do more work?" It already does.
The real question is: when AI is doing the execution, what does your job actually become?
The answer is three things — designing systems well, evaluating outputs rigorously, and testing in the real world. These aren't soft skills or abstract ideas. They are the most concrete, high-leverage things you can invest in right now. And they're entirely human.
Why these three skills? And why now?
AI outputs are probabilistic. They look correct. They sound confident. They pass a quick review. And then they fail in production in ways nobody predicted — on inputs nobody thought to test, in edge cases that only appear when real users show up.
This is the core challenge of building with AI. The model doesn't know what it doesn't know. It has no stake in your system working well. It will hallucinate, misinterpret, and generalise incorrectly — and it will do all of this fluently.
That gap — between AI output and an actual working system — is exactly where human skill lives. Design, evaluation, and real-world testing are how you close it.
Let's go deep on each one.
1. Iterative Design
Most people treat AI like a vending machine. Put in a prompt, get out a result, ship it.
That's not how systems work. And it's definitely not how good systems get built.
Iterative design means you don't try to get it right the first time. You build the smallest working version, put it in front of real conditions, watch what breaks, and refine. Over and over.
This matters more with AI than with traditional software for one reason: you cannot fully predict how an AI component will behave until it runs on real inputs. A RAG pipeline that returns perfect results on your curated test set will hallucinate on a live knowledge base with messy, inconsistent documents. An agent that handles your 10 test cases cleanly will hit an edge case on the 11th real user interaction.
The design process has to account for this. That means:
- Building a minimal working prototype first — not a complete solution
- Testing early with real or realistic inputs, not just clean examples
- Using each iteration to surface failure modes you couldn't anticipate upfront
- Letting the system tell you what it needs, rather than designing everything in advance
The people who build the best AI systems aren't the ones with the best first prompt. They're the ones who iterate fastest and learn from each cycle.
Don't aim for perfect on the first build. Aim for running. Then aim for better.
2. Evaluation
This is the skill most people skip entirely — and the one that matters most at scale.
When AI does the execution, your job shifts to a different question: is this actually good?
Not "does it look good." Not "does it pass a quick read." But — does it consistently produce accurate, reliable, task-appropriate output across the full range of inputs it will encounter in the real world?
Answering that requires a real evaluation strategy. That means:
Define what good looks like before you run the system — not after. If you don't have a standard before you see the output, the output will define your standard, which means you'll always think it's good enough.
Functional testing — does the system do what it's supposed to do? Test across different input types, complexity levels, and phrasings. Don't only test the happy path.
Boundary testing — what happens at the edges? Ambiguous inputs. Unusual queries. Very long or very short context. Missing data. This is where AI systems break most often — and where most engineers stop testing too early.
User feedback loops — what do real users tell you about whether it works? Task completion rates, explicit ratings, and patterns in interaction logs all surface things that technical testing misses.
The key insight here: fluency is not correctness. AI systems produce output that reads well even when it's wrong. Evaluation is how you build systems that are actually right — not just ones that look right.
3. Real-World Testing
Controlled environments lie.
Your local test suite, your curated dataset, your handpicked evaluation set — they all test the system you imagined. Real-world testing tests the system you actually built, against the users and inputs you didn't imagine.
This is where most AI projects break down. The system works in development. It passes all the tests. It gets deployed. And then real users show up with unexpected phrasing, edge-case documents, concurrent load, and ambiguous requests — and the whole thing behaves differently than expected.
Real-world testing is how you find out before it matters, or fix it fast after it does.
Deploy in phases. Start with a limited rollout — a small user group, a single use case, a low-stakes environment. Let the system run on real traffic before you scale.
Monitor for signal, not just uptime. Response time and error rates tell you if the system is running. They don't tell you if it's working. Track accuracy, task completion, user satisfaction, and failure patterns.
Collect feedback actively. Build in mechanisms for users to signal when something went wrong — even something as simple as a thumbs down or a correction. These signals are the most valuable data you'll get.
Feed everything back into the loop. Real-world observations go back into design and evaluation. The loop doesn't close. It gets tighter.
The best AI engineers aren't the ones who build perfect systems from the start. They're the ones whose systems improve the fastest — because they've closed the feedback loop between real use and the next iteration.
The bigger picture
AI is taking on more execution work every month. That's not slowing down.
What doesn't change — what actually becomes more valuable as AI handles more execution — is the human capacity to design systems thoughtfully, evaluate outputs honestly, and test them against reality.
These three skills are not about controlling AI or being skeptical of it. They're about working with it at the level it actually requires. AI is powerful precisely because it generalises — and it fails for exactly the same reason. Design, evaluation, and real-world testing are how you take something powerful and make it reliable.
That's the work. And it's entirely yours.