Microagent Architecture
What agent builders should take from microservices, and what they should refuse.
The kitchen-sink agent
There is a failure mode that most folks building with LLM agents eventually meet. The agent started clean: a focused prompt, four tools, one job, reviewing pull requests against the team’s conventions. Then it accreted. Someone added a Jira tool so reviews could link tickets, then a Linear tool when half the org migrated, then a paragraph explaining when to use which. Edge cases got patched with more instructions. A retrieval step began dumping documents into the context just in case. A year later the system prompt is six thousand tokens, the tool list is pushing forty, and the agent has developed moods. It searches the wiki when it should read the code. It obeys an instruction in paragraph three that paragraph nineteen was supposed to override. It does fine for the first twenty minutes of a review and then, somewhere past the hundredth tool call, quietly loses the plot. Fixing one behavior regresses two others, and nobody can say why, because the only test that exists is running the whole thing end to end and squinting at the output. (This particular agent is a composite. If it doesn’t resemble something running in your org today, give it a quarter.)
If you spent the 2010s anywhere near backend engineering, you have seen this shape before. It is a monolith. And the industry’s response is starting to rhyme with the last one: decompose it. Call the result microagent architecture, the agent-era echo of microservices, with most of the same promises, several of the same traps, and some physics that are genuinely new. Keep the review agent in mind. We are going to take it apart.
The analogy, made precise
A microagent is a small agent with a single responsibility, a context containing only what that responsibility requires, a toolset scoped to the job, and an explicit contract: a structured task goes in, a structured result comes out. A microagent architecture is a set of these coordinated by an orchestration layer, which may itself be an agent.

A confession before the mappings: this essay coins a term, builds a framework around it, and will later warn you about adopting frameworks because they sound good. I am doing the thing I am warning about. The only defense is the test I would apply to anyone else’s coinage — the mappings have to hold under pressure, and the analogy has to fail somewhere instructive. Most of what follows is that test.
Start with the most important mapping: the context window is the bounded context. In a service architecture, the unit of isolation is the process and its data store. In an agent architecture, it is the context window — the finite span of tokens within which the model can actually attend to things. Nearly everything that goes wrong inside a monolithic agent is contention for that span.
The rest of the correspondences follow. A scoped toolset is a narrow API surface. A per-agent prompt is an independently versioned codebase: you can rewrite the code-review agent’s instructions without touching the migration agent’s, and you can run regression evals against it in isolation, which gives you a unit test where previously your only option was the integration test. Per-agent model selection is the polyglot stack: a fast, cheap model for triage and extraction, a frontier model for synthesis and judgment, each priced to its job. Fanning subtasks out to parallel workers is horizontal scaling. Even the protocol layer arrived on cue. MCP did for tool access what HTTP and JSON did for service integration, which is to say it made the plumbing boring, and agent-to-agent protocols like A2A are reaching for the same standardization between the agents themselves.
Why the monolithic agent breaks
Monolithic services mostly failed for organizational reasons: the release train, the merge queue, eight teams coupled through one deployment. Monolithic agents fail for a more physical reason. Attention is finite, and it degrades.
The first pressure is context degradation. Models reason worse as contexts grow; the research literature calls one flavor of this “lost in the middle,” and practitioners call the broader phenomenon context rot. Worse than raw length is pollution. By the time the review agent finally turns to the diff, the window holds the full text of thirty files it explored, four hundred lines of Jest output, and two Jira tickets, all of it competing with the instructions that actually matter. A monolithic agent carries its entire history everywhere, and most of that history is debris.
The second is tool confusion. Selection accuracy falls as the tool count rises, especially when descriptions overlap. An agent with six tools picks the right one almost every time. Give it forty, several of which are different flavors of “search,” and it starts grabbing the wrong one, then burns turns recovering. Ask the review agent whether a new helper duplicates an existing utility and watch it try search_jira, then web_search, before it thinks to grep the repo.
The third is prompt sprawl. Every instruction in a monolithic prompt is global. The paragraph you added to stop the agent flagging snake_case in the Python services now suppresses naming feedback in the TypeScript ones. There is no modularity, so there is no unit testing, only the end-to-end run, which is slow, expensive, and statistical. Folks learn to fear touching the prompt, which is the same learned helplessness the monolith taught, with the same result: the artifact ossifies.
The fourth is economics. One agent means one model for every step. You pay frontier prices to check that a changelog entry exists because the same context has to reason about a race condition three files later.
Underneath all of these sits the simplest constraint: one context is one thread of attention. There is no parallelism inside a monolithic agent. A breadth task, say surveying twelve services or checking thirty files, happens strictly in sequence while the debris piles up.
The patterns in the wild
None of this is hypothetical. The patterns are in production, and most working engineers have already touched at least one without naming it.
The flagship is orchestrator-worker. A lead agent decomposes the task, dispatches subagents in parallel, and synthesizes their results. Anthropic’s research feature is the best-documented example: a lead agent plans, spawns parallel search subagents that each burn a private context iterating over sources, and synthesizes what comes back. Their engineering write-up reported that this arrangement, with a frontier model leading and lighter models working, beat a single frontier-model agent by roughly ninety percent on their internal research eval. Their illustration of why is worth repeating: asked to identify every board member across the S&P 500’s information technology companies, the multi-agent system decomposed the list and fanned it out, while the single agent ground away at sequential searches and never produced the answer. The same analysis found that token spend explained most of the performance variance, and the productive way to spend more tokens turned out to be more contexts rather than longer ones. Hold on to that thought; the bill comes due a few sections down.
The write-up is just as useful on what went wrong before it went right. Early versions spawned fifty subagents for simple queries and scoured the web for sources that did not exist, and the fix was not cleverer architecture but explicit scaling rules written into the lead agent’s prompt: one agent and three to ten tool calls for simple fact-finding, two to four subagents for direct comparisons, ten or more only for genuinely complex research. Note what that is. It is a decision rubric for when to decompose, running in production, and we will want it again at the end.
The everyday version is Claude Code’s subagent system. You define agents as files, each with its own system prompt, its own tool allowlist, optionally its own model, and the main session delegates to them. The subagent does its forty file reads and dead-end greps inside a disposable context; what crosses back into the main thread is a summary. The main context stays clean enough to keep making good decisions, which is the entire point.
Here is what decomposing the review agent actually looks like. One of its three replacements, complete, a markdown file dropped into .claude/agents/:
---
name: security-reviewer
description: Reviews pull request diffs for security issues. Use on any PR touching auth, input handling, secrets, or dependencies.
tools: Read, Grep, Glob
model: sonnet
---
You review pull request diffs for security problems only: injection,
authorization gaps, secrets in code, unsafe deserialization, risky
dependency changes. Read the diff and any file it touches. Do not
comment on style, naming, or test coverage; other agents own those.
Report each finding as file, line, severity, and a one-sentence
rationale. End with a list of anything you could not verify.
Two more like it, test-coverage-reviewer and conventions-reviewer, and the main session stops being a reviewer and becomes an orchestrator: it reads the diff once, dispatches all three in parallel, and synthesizes their findings into one review. The conventions reviewer runs on the smallest model available, because you don’t need a frontier model to notice a missing changelog entry. And look at the frontmatter, because the frontmatter is the architecture. tools is the scoped API surface. model is the polyglot stack. The file sitting in version control is the independently versioned, independently testable codebase. Every mapping from two sections ago is right there in a dozen lines of YAML and prose.
Around the flagship sits a supporting cast with obvious service-era ancestors. The router is the API gateway: a cheap classifier reads the request and dispatches to a specialist. The pipeline is the ETL job, each agent transforming a structured input into a structured output and passing it along; this is closer to a workflow than an agent, and usually better for it. The critic loop pairs a generator with an evaluator, which is code review promoted to architecture. Hierarchies, agents spawning agents, are org charts, with everything that implies.
Where the analogy breaks
An analogy earns its keep where it fails, and this one fails in both directions.
The failure that flatters agents first. Decomposing a service traded runtime performance for organizational velocity; a network call is never faster than a function call, and you accepted the latency to buy team autonomy. Decomposing an agent can improve the output itself, because a narrow, clean context reasons better than a bloated one. The subagent that knows only its task will often beat the omniscient agent dragging six hundred lines of irrelevant scrollback. In services, isolation bought maintainability. In agents, isolation buys capability. That is a better deal than microservices ever offered.
Now the failures that do not flatter. When service A calls service B, the payload arrives intact, byte for byte. When an orchestrator hands work to a subagent, it writes a brief, and when the subagent reports back, it writes a summary. Both are lossy compressions performed by a language model with no way to know which details will matter later. Watch the bug happen in the decomposed review agent. The orchestrator knows, from three turns of conversation, that this repo does not use JWTs; auth runs through a custom session middleware in src/auth/session.ts. The brief it writes says only “review the auth-related changes for security issues.” The security subagent greps for jwt and oauth, finds nothing alarming, and returns a clean report. The posted review says no security concerns. No exception was thrown, no tool failed, every agent did its job. The constraint the orchestrator knew but never wrote down, the anomaly the subagent noticed but left out of its summary — this is the new class of integration bug. Multi-agent systems play telephone by design.
Anthropic’s write-up names the failure exactly that, “the game of telephone,” and its appendix offers a mitigation: have subagents write their full output to the filesystem and pass back lightweight references, so the synthesizer can consult the source instead of the summary. It works, and notice the price. The fix for lossy handoffs is more shared state, which is the fourth failure on this list. Contract discipline, which was good practice in services, is survival here.
Second, the components themselves are probabilistic. Microservices composed deterministic code over an unreliable network, and we built circuit breakers for the network. Microagents compose unreliable components over a reliable network. Chain five steps that each succeed 95 percent of the time and the pipeline runs at about 77. The mitigation has math too: add a validator and one retry per step, and per-step reliability rises to 99.75 percent, putting the pipeline back around 98.8. That assumes the validator catches every failure, and validators miss things too — so treat 98.8 as a ceiling. Validators and retries are the new circuit breakers, but every retry doubles the spend on that step, which feeds the next problem, and the floor stays lower than code: every added hop adds failure surface in a way that adding a service call did not.
Third, the meter is always running. Every boundary crossing costs tokens and latency, and shared background has to be paid into each new context separately. Put numbers on it. A 3,000-token brief, the diff summary plus repo conventions plus the task description, fanned out to three review subagents is 9,000 input tokens spent before any agent has read a file, and every follow-up wave repeats the charge. Each subagent then pulls what it needs, call it 40,000 tokens of diff, files, and grep results apiece, for 120,000 more. The monolith is not innocent in this comparison: it re-sends its own swollen window on every turn, and prompt caching discounts that cost rather than eliminating it. But the decomposed system pays the duplication and the fan-out on top. Anthropic measured the net effect for their research system at around fifteen times the tokens of an ordinary chat session. What keeps that from being ruinous is mostly which model does the reading. At current list prices, Anthropic’s small model takes input at $1 per million tokens and its frontier tier at $5, so routing the bulk reads through subagents on the small model claws back much of the multiplier. Per-agent model selection is the load-bearing economic mapping, not a nice-to-have. The quality gain was real and so was the bill, and they said plainly that the architecture only makes sense for tasks whose value clears it.
Fourth, shared state has no transactions. The filesystem, usually a repo, becomes the shared database, and two agents writing to it get no isolation levels and no locks unless you build them. Two coding agents editing the same file is a write conflict that nothing in the architecture detects for you.
And one limit that is about fit rather than physics: tightly coupled work resists partition. A task where every decision needs the whole picture, which describes most prose and most single coherent code changes, gets worse when you shard the picture across contexts. The review agent marks the boundary precisely: the review fans out, three independent readings of one diff, but the fix it recommends does not, because a coherent change across four files needs one context holding all four. Research parallelizes. Judgment mostly does not.
The cautionary tale is the point
The most useful thing microservices can teach agent builders is not the boom. It is the backlash.
Start with the distributed monolith, the canonical anti-pattern: services nominally separate but coupled through shared databases and synchronized deploys, all of the overhead with none of the autonomy. Its agent twin is already common. A “crew” of five agents that share full conversation history, where each agent needs to know everything the others know, is one agent with extra latency and five bills. The smell is visible in the config: five agent definitions, one shared-memory flag, and the same thirty tools pasted into each. If you cannot state what an agent does not need to know, you have not designed a boundary; you have drawn a line through a prompt. Call it the distributed prompt.
Premature decomposition transfers almost word for word. Martin Fowler’s monolith-first advice, that you should not begin with microservices even when you are confident the system will eventually justify them, needs no translation at all. Anthropic’s own guidance on building agents opens the same way: find the simplest solution that works, and add complexity only when it demonstrably earns it. Most problems that look like they need a team of agents are workflows wearing a costume.
Some folks argue the stronger position outright. Cognition’s “Don’t Build Multi-Agents” essay makes the case that agents should share full context and that subagent coordination fails along exactly the lossy-handoff lines above, and for tightly coupled work I think they are right. The question is how much of your workload is actually tightly coupled, and the answer can be found your traces. There is no single doctrine that covers all workloads, just there is no single architectural pattern that covers all codebases.
The pendulum swing is predictable because we have already watched it once. In 2023 a Prime Video team published a write-up describing how it collapsed a distributed serverless monitoring pipeline back into a monolith and cut infrastructure costs by roughly ninety percent, to considerable industry schadenfreude. The agent version of that post is being drafted somewhere right now: a team that replaced its five-agent crew with one well-written prompt and watched cost, latency, and quality all improve. Decomposition is a response to measured pressure. It is not a starting posture, and it is definitely not an aesthetic. Borrowing the topology without the pressure is cargo-culting.
When to split, and what to build first
The signals that you have outgrown a single agent are measurements, not vibes. One warning first: this rubric is simple enough that it is easy to misapply, and the usual misapplication is splitting because one signal flickered once. Hold it loosely.
- Context overflow on real workloads: tasks failing because the relevant fact scrolled out of the window or drowned in debris. The measurement: pull your failed traces and check whether the fact the agent needed was in the window when it answered. Absent means overflow; present but unused means degradation; either way you get the context size where your failure curve bends.
- Measured tool confusion: wrong-tool selections occurring at a rate that moves your error budget. The measurement: sample fifty traces and count how often the first tool call was the wrong tool.
- Parallelizable breadth: the task decomposes into independent reads, like research, survey, or review fan-out, and wall-clock time matters. The test is structural: list the reads, mark which ones depend on another’s output, and time the sequential run.
- Capability mismatch: a step that succeeds on a model a fifth the price, trapped in a context that forces the expensive one. The measurement: replay a sample of that step on the small model and diff the outcomes.
- Ownership: two teams need to iterate on two behaviors without retesting each other’s work. No measurement required; the third time a prompt edit causes a cross-team regression, the team members involved are starting to notice a pattern and should discuss how they might pull things apart.
When you do split, build the disciplines before the agents.
- Contracts: structured task in, structured output out, avoiding transcript dumps where possible, with the brief written and versioned like code, because the orchestrator’s instructions are now a critical contributor to your system’s accuracy. If defining this contract is hard, that’s a smell that have a poorly defined task for the new microagent.
- Traces: spans across every agent call, because you cannot debug a system you cannot replay, and “the subagent did something weird” is not a bug report.
- Evals per agent: a set of test cases for each agent — sample inputs and the answers you expect back. For the security reviewer, that’s a handful of diffs whose findings you already know. Run them on every prompt change, exactly as you would unit test a service.
Here is what a contract looks like for the security reviewer. The first half is the brief, the task going in:
{
"task": "security-review",
"diff": "PR-4112.diff",
"scope": ["src/auth/**", "src/api/middleware/**"],
"must_consider": ["auth uses custom session middleware in src/auth/session.ts, not JWTs"],
"out_of_scope": ["style", "naming", "test coverage"]
}
The second half is the result coming back, in the shape the agent definition promised:
{
"task": "security-review",
"findings": [
{
"file": "src/auth/session.ts",
"line": 112,
"severity": "high",
"rationale": "Refactored validate() logs the full session token at debug level; tokens in logs outlive the session."
}
],
"not_checked": ["whether the session store enforces TTL; depends on Redis config not present in the diff"]
}
The must_consider field exists because of the session-middleware bug two sections back; the brief is where the orchestrator’s knowledge either crosses the boundary or dies, and the finding above exists only because it crossed. The not_checked field exists because a subagent’s silence is not evidence, and the synthesizer should never be allowed to mistake silence for a clean result; it is the difference between a report that says “clean” and one that says “clean as far as I could see.” A team unwilling to build these three disciplines is not ready to split. It is just distributing its prompt.
Context is the new coupling
Architecture has always been the management of a scarce resource. For the monolith era the scarce resources were the release train and the coupling between teams, and microservices spent latency and operational complexity to buy autonomy back. For agent systems the scarce resource is attention — the model’s finite and degradable focus — and the unit of isolation that protects it is the context window.
The lasting contribution of microservices was never the topology. Plenty of teams shipped beautifully on monoliths, and plenty drowned in service meshes. The contribution was the discipline: explicit contracts, observability as a precondition, decomposition under pressure rather than by default. Microagent architecture deserves exactly that treatment, adopted for the same reasons and resisted with the same skepticism. Take the discipline. Make the topology earn itself.
And if you want one concrete place to start: the next time you’re tempted to split an agent, write the brief you’d hand the subagent — scope, must-considers, the shape of the reply — before you build anything else. Sometimes finishing the brief is how you discover you were one good prompt away all along. When it isn’t, you’ve already written the first artifact of the new architecture.