Do we still need TDD?

Most of my career, I’ve worked on teams that valued TDD, either religiously or as a common mode of development. When I started using agentic coding tools seriously, I noticed something uncomfortable: these agents don’t practice TDD, they can’t be made to practice TDD, and yet the things TDD gave me still matter. I’ve been working through what that means, and I don’t have all the answers yet.

TDD is roughly defined by the following principles:

Illustration of a checklist, partially completed

Incremental development. TDD’s red-green-refactor cycle (write a failing test, make it pass with minimal code, then clean up) keeps changes small and reversible. The goal is to never be more than a few minutes away from working code. This shrinks the debugging surface area—if a test fails, you know the problem is in the handful of lines you just wrote.

Design feedback, not just verification. Writing tests first forces you to think about your code’s interface and behavior before implementation. If something is hard to test, that difficulty is often a signal that the design has coupling problems, unclear responsibilities, or hidden dependencies. The test acts as the first client of your code.

Executable documentation. Tests describe what the code is supposed to do in concrete terms. Unlike comments or external docs, tests can’t drift out of sync with the implementation because they’ll fail. The goal is a living specification that stays accurate.

Confidence to refactor. A comprehensive test suite means you can restructure code aggressively without fear. The goal is to make the codebase malleable over time rather than increasingly rigid as it grows. (Granted, you can overtest and cause rigidity from the other direction, but that’s a separate discussion.)

Scope discipline. The TDD discipline of “write just enough code to make the test pass” serves as a forcing function against scope creep and speculative generalization. Your test suite defines the boundaries of your codebase’s scope.

Whether you practice TDD or not, I think most engineers understand the value of all the above.

Agentic coders are bad TDD practitioners

If you’re a TDD practitioner, the way that agentic coding tools like Claude Code, Codex, and Gemini write code is unsettling. They are most definitely not TDD practitioners by nature. And there’s good reason for that: they’ve been trained on codebases full of abstractions, design patterns, and fully mature architecture. By default, they’ll often produce interfaces with one implementation, configuration systems for things that have one value, abstraction layers “in case requirements change,” comprehensive error handling for conditions that can’t occur in your context, and plugin architectures when you need exactly one plugin.

None of this costs the agent anything to produce. It flows naturally from pattern-matching on existing code. And it doesn’t cost you anything in the moment—you didn’t have to write it. The cost is hidden and bites you later because there’s more code to understand, more surface area for bugs, and more inertia against change. You can very quickly end up with a bloated codebase that demands a heavy cognitive load to understand and manipulate. Worse, since you didn’t directly author it, you lack the intuitive understanding that comes from the sweat and tears of writing it by hand.

"VIBES" illustration

I learned this the hard way. When I first started using Cursor with Sonnet 3.5, I got excited by its abilities and decided to push its limits on a side project. I started vibe coding (though it hadn’t been labeled as such yet). It was incredible! But soon I crossed a threshold where I realized I had lost a grasp of the codebase. What I thought the codebase was doing and what it was actually doing had diverged several cycles back. Sonnet had begun hallucinating successful implementation of a piece of critical functionality. When I dug deeper, I discovered that the implementation was deeply flawed, and figuring out where to unwind to was genuinely challenging. I hadn’t been practicing TDD because this was just a little side project and I was having fun exploring the boundaries of a new paradigm. I hadn’t been making small incremental changes and commits. I was just letting Sonnet go and committing whenever I wanted to create a checkpoint.

Forcing traditional TDD into agentic coding is performative

These days, Opus 4.5 and Claude Code (and similar agents) are considerably better at writing correct code with fewer hallucinations. But none of them are naturally TDD practitioners, and the risks associated with not practicing TDD remain. Even if you explicitly demand an agent follow TDD in an AGENTS.md/CLAUDE.md file, it will often ignore that instruction. When it does make an attempt to follow it, all it really does is write tests. It doesn’t follow red-green-refactor. It doesn’t incrementally implement functionality. It writes the entire test file and implementation at once and runs the test suite after to see if it passes. That’s writing tests; it is not test-driven development.

We could go to a lot of effort to add guardrails (e.g. Claude Code hooks) in an attempt to enforce a workflow that’s more true to TDD, but I’d rather take a step back. TDD is a process designed for human engineers writing software. Its rituals are designed around the human experience of designing, writing, and evolving code based on specifications provided by the product owner. It intentionally adds friction at specific stages of the development process to force the human engineer to think about the design, scope, and behavior of the code they’re writing. Even if we add the ritual to the agentic coding process, the agent isn’t going to be affected by the process in the same way that the process affects the human coder. It would be performative, not meaningful.

Are TDD values still relevant and valuable?

If forcing traditional TDD into agentic coding is performative, does that mean TDD itself is a relic of the pre-LLM era? I don’t think so, but it requires separating the rituals from the values. The principles I outlined above are focused on the concrete actions—the “hows”—of TDD. What are the underlying values—the “whys”—and are those values still important when an agent is writing the code?

Rapid feedback loops are valuable. The shorter the gap between writing code and knowing whether it works, the easier problems are to fix and the less context you lose. TDD compresses this loop to seconds or minutes rather than hours or days. This principle extends beyond testing—it’s why compilation errors are easier to fix than runtime errors, and why continuous integration catches integration issues faster than end-of-sprint merges. Whether a human is writing the code or an agent is, a rapid feedback loop is valuable to the process.

Separating concerns in thinking leads to better decisions. TDD explicitly separates three mental modes: deciding what the code should do (writing the test), making it work (green), and making it clean (refactor). Trying to do all three simultaneously leads to muddled decisions. By forcing sequential phases, you can focus fully on each concern without juggling competing goals. Agentic coders fall into the same trap when the context window contains the instructions and history of executing on all three concerns at once—and they aren’t even following the red-green-refactor phase gating.

Designing for testability equals designing for usability. Code that’s easy to test in isolation tends to have clear inputs and outputs, minimal hidden state, explicit dependencies, and well-defined responsibilities. These same properties make code easier to understand, reuse, and modify. Testability becomes a proxy metric for general code quality. In my experience, this matters even more in an agentic coding environment where we’re dealing with limited context windows. When business logic is spread across multiple files with poor boundaries, hidden state, and implicit dependencies, the agent has a much harder time reasoning about the code and struggles to fit it all in the context window without muddling it with unrelated concerns. This inevitably results in poorer code generation.

Working software as a verifiable ground truth. Rather than reasoning abstractly about whether code is correct, TDD insists on demonstrable behavior. The test suite is a collection of existence proofs—“here is evidence that this specific behavior works.” This shifts arguments about correctness away from speculation and toward deterministic observation. Agentic coders will often speculatively declare that some code is correct when it isn’t. We still need deterministic evidence of correctness.

Sustainable pace through reduced rework. Bugs found later cost more to fix, both in time and in collateral changes. TDD front-loads the cost of quality rather than deferring it. The principle is that consistent small investments beat sporadic large ones when compounded over a project’s lifetime. When I was experimenting with Cursor and Sonnet 3.5, if I had been making small, test-verified changes and frequent commits, I would have realized much sooner that the code wasn’t doing what I thought it was—and it would have been far easier to identify the commit to revert to.

Humility about reasoning ability. TDD assumes we’re not good at holding complex systems in our heads or predicting all edge cases upfront. It substitutes confidence with automated verification, acknowledging that “I think this works” is weaker than “I have a passing test that demonstrates this works.” In my experience, coding agents are even worse than we are at holding complex systems in their “heads.” Their context windows are much smaller than our cognitive load capacity.

Scope discipline. TDD’s “write just enough code” and YAGNI (“You Ain’t Gonna Need It”) constraint resists the natural tendency to build for imagined future requirements. By limiting implementation to what the current test demands, you avoid accumulating code that serves no present purpose but carries ongoing maintenance cost. The test suite becomes a forcing function that keeps scope anchored to demonstrable needs rather than speculated ones. This value becomes arguably more critical with agentic coders, since, as I noted earlier, agents will freely produce abstractions, plugin architectures, and configuration systems that cost them nothing to generate but burden you with unnecessary complexity.

The “whys” are just as relevant when coding agents are writing code as they are when humans are—arguably moreso.

Brain illustration

Rethinking TDD principles within agentic constraints

So if forcing traditional TDD into agentic coding is performative but the values of TDD are still relevant, where does that leave us? TDD rituals were designed around the human experience and our cognitive constraints. When an agent writes code, the constraints shift. I’ve been working through each principle and how it might manifest differently in an agentic workflow.

Rapid feedback loops—but feedback on what?

In human TDD, the loop is “write code → run test → learn if code is correct.” With agents, the tighter loop becomes “specify intent → agent generates → learn if agent understood correctly.”

The problem shifts from implementation correctness to specification clarity. You might write a test that passes, but the agent satisfied it in a way that technically works while missing your actual intent. The feedback you need most is whether your specification was unambiguous enough.

What I’ve found effective is seeing agent output quickly and in small pieces. Asking an agent to build an entire feature in one shot breaks the feedback loop—you get a wall of code and no way to localize where misunderstandings crept in. Incremental generation with verification checkpoints preserves the principle even if the mechanism looks different.

Separation of concerns—different concerns now

Human TDD separates “what should it do” (test), “make it work” (green), and “make it clean” (refactor). With agents, the human role shifts almost entirely to the “what” while the agent handles implementation.

But a new concern emerges: validation that intent was preserved through the translation from natural language specifications to code. You’re now operating in a specify → generate → validate loop. These phases benefit from explicit separation. Trying to specify, review generated code, and assess design quality all at once leads to the same muddled thinking that TDD’s phases were designed to prevent.

I’d encourage treating these as genuinely distinct steps: first write your specification (tests, examples, or natural-language contracts), then let the agent generate without simultaneously reviewing, then validate as a separate step. Mixing them together invites confirmation bias—you see the code and unconsciously adjust your sense of what you wanted.

Testability as a proxy for quality—where’s the friction?

Here’s a real challenge. In human TDD, you experience the pain of testing tightly coupled or poorly designed code. That pain is the signal. Agents don’t feel pain. They’ll happily generate code with hidden dependencies, implicit state, or tangled responsibilities and won’t report any difficulty.

This means testability friction has to be reintroduced deliberately. I’ve been experimenting with a few approaches: using a separate review pass (human or agent) specifically focused on testability and design rather than just correctness; asking the agent to generate tests before implementation from the same specification, where its struggle to write clear tests surfaces design smells early; and asking the agent to explain how it would test the code it just wrote, where vague or complicated answers indicate problematic structure.

The underlying principle still holds: testable code is better code. But we need more explicit mechanisms to surface the quality signal that human TDD practitioners used to feel in their bones.

Working software as ground truth—more important, not less

Agents produce plausible-looking text. Code that reads correctly but doesn’t actually run correctly is a genuine failure mode of agentic coding systems. In fact, it is the basis for Reinforcement Learning with Verifiable Rewards (RLVR), a now-critical post-training technique for LLMs.

This makes execution-based verification more essential than ever. And since LLMs have a tendency to satisfy specific examples while missing general behavior in ways humans wouldn’t, I’ve found it worth considering property-based tests and fuzzing as supplements to your testing strategy.

Humility—now about two unreliable systems

TDD’s humility principle is about not trusting human reasoning. Now you have two reasoning systems to distrust: yours (for specification) and the agent’s (for implementation).

This is where I’ve found value in adversarial or independent verification. One approach: write the specification, have the agent implement, then write tests independently yourself—not just reviewing agent-generated tests. Your tests probe what you meant; discrepancies reveal either agent misunderstanding or ambiguity in your spec. Another: have one agent implement and a separate agent (or separate context) review or test. Independence matters here because an agent asked “does this code match this spec” is doing a fundamentally different thing than an agent asked “here’s a spec, write tests for it” followed by running those tests against the implementation.

Scope discipline—the constraint the agent lacks entirely

In human TDD, “write just enough code to make the test pass” is self-enforcing: you feel the effort of writing unnecessary code, so you don’t write it. Agents have no such constraint. They’ll generate abstractions, configuration layers, and “future-proofing” infrastructure as readily as the minimal solution because it costs them nothing. The agent won’t naturally resist scope creep; you have to impose it externally.

Human involvement is necessary here because someone has to take responsibility for the code as committed, and that requires real judgment. I would not fully delegate this responsibility, especially for meatier code contributions.

In my experience, being explicit about scope in your specification helps considerably. Rather than “implement user authentication,” try “implement password-based login for a single user type with no OAuth, no social login, no multi-factor—just email and password.” The agent will still try to over-engineer; explicit constraints give you leverage to push back.

From there, I’ve found that using a combination of tests and code coverage tooling to make your test suite the scope boundary is effective. If a piece of generated code isn’t exercised by any test, question whether its existence is justified or speculative. Pruning becomes an explicit part of your review workflow.

Even if the tests exercise the code, that doesn’t mean we’ve avoided scope creep in implementation complexity. I’d encourage reviewing specifically for YAGNI violations: what code here serves no current requirement? Interfaces with one implementation, configuration for single values, and abstraction layers for hypothetical extensions are candidates for removal.

The underlying value of avoiding accidental complexity from building for imagined futures remains critical, but where TDD’s mechanism was additive friction for the human writing the code, the agentic equivalent may be subtractive review.

What does this look like in practice?

I’m still thinking through and experimenting in my own Claude Code setup, but I think it roughly looks like this.

First, the specification, generation, and validation phases need to be distinct, with explicit transitions between them. Not because ritual matters, but because mixing phases leads to the muddled reasoning I discussed earlier.

Tests, examples, or contracts serve as the canonical input—specification as the source of truth. The agent’s implementation is measured against this, not against vague notions of what you wanted. This also creates an audit trail: here’s what was specified, here’s what was generated, here’s how they compare. Something approximating this has already begun to emerge as a proposed methodology through spec-driven development (in this context, the specs are natural language, not code), though the solutions are still fairly immature and rapidly evolving.

After generation, running the test suite, code coverage, static analysis, and possibly a separate review agent in an automated fashion (e.g. with hooks) provides automated validation pipelines that give us deterministic and relatively instant feedback. This can be fed right back into the agentic loop, minimizing the manual work in the human review step.

Building in minimality checks has been valuable in my experiments. Code coverage tools identify unexercised code. Flagging any new code that isn’t exercised by specifications and sending it back to the agent with “potentially unnecessary—justify or delete” gives us an automated mechanism for pushing toward minimality. The agent will still overengineer sometimes, but this reduces the manual review burden.

For independent verification, the framework could invoke a second agent (or the same agent in a fresh context) to generate a second set of tests from the natural language specification. If the implementation passes the original tests but fails the separately-generated ones, you have a discrepancy that reveals either a flawed implementation or a misunderstanding about the specification.

And finally, while we might not be writing most of our code anymore, we still must take responsibility for the code we commit. Once the automated pipeline believes the implementation is complete, correct, and well-scoped, we still need to sit down and review our code. That human review step isn’t going away.