Validation is your agent's bottleneck

Observations from Claude Code

Generation without validation is useless. Although LLMs may be capable of checking that they generated the right output, the issue quickly becomes clear that they may not know whether to do it.

Validation isn’t just about testing, but about whether an output is aligned with your goals which is much harder to gauge. The biggest issue lately isn’t missing capability but missing validation and structured workflows. If there’s a coding workflow that has been done countless times, it’s a blog migration. For any LLM trained on web content, there must be thousands of posts in its training set that have a title like “How I migrated my blog from X static site generator to Y”. Migrating from one static site generator to another is nothing novel. If an LLM struggles in this task, it likely means the problem isn’t capability but tooling.

For any task to succeed, an agent needs knowledge, access, and automated validation.

Knowledge: comes from a model’s training. It can be strengthened through additional context provided via documents.

Access: comes from an agent’s harness and enables your LLM model to interact with the broader world. We’ve seen countless tools in this area now: Codex, Claude Code, etc.

Automated Validation: comes from an iterative process that checks if the outputs of a model are correct. There are many framework options for optimizing prompts, but it is the layer that’s the least defined right now in active usage.

The power of validation, whether automated or human, is that it gives iterative feedback. This gets you beyond what an LLM can generate in one shot. Although an agent may validate part of their work, the open space for what to validate is extremely large. Developers can narrow it by building automated validation flows or using their intuition about what’s meaningful.

Both manual validation and automated validation deserve more investment. After all, when have you provided anyone with a perfect specification to get exactly what you want in one go?

Validation criteria matters more than your specification

The website was a Jekyll blog and I gave an extremely vague prompt to convert the site to React. Even with a lack of clear specification, Claude Code had the knowledge to easily complete the core parts of the task. A Nextjs site was created with some of the content from the original blog as I requested.

People are working on spec-driven development tools and this is a necessary approach that deserves attention. I’ve had to stop an agent countless times and push it in a new direction when my prompt wasn’t clear enough.

However, there’s a deeper issue at play that has a simple solution: checking what correctness looks like. A spec on a site design may not have included the idea that links must be clickable. An agent can easily check if links are clickable by examining the HTML source or by being given a browser harness and seeing what happens when they click on it.

Some basic experimentation with LLMs will quickly show you they don’t always engage in testing like this by default. Even when the context makes sense to do so, such as migrating to a new blogging platform.

When I had Claude migrate my blog, the posts were rendered incorrectly and other critical issues like a post that was supposed to be hidden had been published (i.e. it had “published: false” key in the frontmatter). Even a vague prompt like “fix the broken rendered posts” was enough to change that though. A full specification of what was broken wasn’t necessary, but the attention to fix that was.

Agents need to lean into their strengths and you fill in the rest

Through my usage of LLMs, it’s clear they’ve gotten great at using tools, but it means that they can easily ignore their best capabilities. Structured tasks, like find-and-replace, are clearly better done with code or a CLI. These tasks are consistent, defined once, and can be applied endlessly. Tools are perfect for tasks that can be easily validated automatically, a tool either does what it says it does or it doesn’t.

If validation criteria is ill-defined, using an LLM’s native capabilities rather than any tool can make a lot more sense.

This insight became obvious when I saw that Claude attempted to write a Python script to migrate the posts I wrote. The posts I have written are a mix of HTML, Markdown and scripts.

In this case, automated validation is difficult. It’s nearly impossible to define what it means for a post to be perfectly rendered when it can contain almost any content. In addition, reading through a custom parser script is very time consuming. But human validation is nearly effortless because I can immediately tell when a post looks wrong. Things work best with LLMs when developers and agents can easily validate outputs in a way they find natural. Taste is the ability to instinctively tell that something looks “wrong”.

Frictionless generation unlocks new abilities to the limit of your ability to validate

The magic of AI is giving you abilities you didn’t have before, and friction doesn’t mean you deny that. Lately there’s been a lot of discussion about taste, which in my view is the ability to recognize when outputs look good. Taste is the fundamentally human aspect of validation and the necessary counterpart to what cannot be validated automatically.

I may understand React well enough, but I have little experience building beautiful UIs. Claude can do that for me to the limit of what I understand to be beautiful. It has enough access to my computer to build anything and enough knowledge to know what could look good in its training. The limit isn’t my inability to build beautiful UIs but to recognize them and to provide any form of validation criteria on UI design.

My first experiment with Claude was building a dashboard completely from scratch for my AWS side projects. This is something I never would've had the time for earlier, but I knew exactly what I was looking for out of it and I knew what an incorrect dashboard would look like. Knowing what a bad dashboard looks like is taste.

If generation is frictionless, automating the validation layer allows you to constrain the outputs. If you can constrain the outputs automatically, you can automate the task. This means the easiest code to write will become that which is easiest to automatically validate.