Why do AI agents need role separation at all?

Because planning, generation, evaluation, and execution each need different judgment lenses, and mixing them tends to amplify mistakes.

Does multi-agent always beat a single agent?

No. The value comes from clearer responsibility boundaries and lower error rates, not from agent count alone.

Who is this article most useful for?

People designing AI workflows, agent systems, or team-like coordination patterns.

TY WangMarch 26, 20263 min read

Last updated: March 26, 2026

Would you let the same engineer code, test, review, and deploy alone?

Anthropic's new article made me more certain that AI agents also need role separation. A lot of team lessons get repeated almost exactly.

AI Team DesignAgent ArchitectureCTOWorkflow

TL;DR

Key takeaways first

>When one AI agent is asked to plan, generate, review, and deploy, it often recreates the same coordination problems human teams already learned the hard way.

>Role separation matters because AI systems can compound mistakes much faster than human workflows.

>The real point of this article is orchestration and responsibility design, not multi-agent hype.

Would you let the same engineer code, test, review, and deploy alone?

AI needs team structure graphic

Probably not. But that is basically what we keep asking AI to do.

Anthropic recently published a technical write-up on how they used Claude to build a full application from scratch. What stood out to me was the conclusion: work that looked weak when handled by one AI started looking much better once it was split across three.

That is basically team design logic.

1. The problem with one agent doing everything

Their first instinct was obvious: let one AI do the whole thing. Planning, coding, testing, bug fixing, all in one loop.

Two problems showed up quickly:

once the context got too full, the model started rushing to finish and quality dropped
when the model reviewed its own output, it rated itself too generously

Anthropic calls this inflated self-assessment. In more normal language: AI also gets overconfident about its own work.

2. Borrowing the idea from GANs

The fix was to separate the part that builds from the part that criticizes.

Their final system used three agents:

Planner: more like a PM, turning vague requests into a clear spec
Generator: more like an engineer, implementing features in sprints
Evaluator: more like QA, using Playwright to actually operate the app and find real problems

The Generator and Evaluator also agree on a sprint contract before work begins. If that sounds familiar, it is because it is very close to sprint planning.

3. The numbers are hard to argue with

They compared the same task in two ways:

one AI: 20 minutes, $9, but the core feature was broken
three AIs: 6 hours, $200, but the product actually worked

Yes, the cost is much higher. But the difference is also basically "usable" versus "not usable." In another DAW example, it was the Evaluator that caught the bug, not the Generator reviewing its own work.

4. Anyone who has led a team will recognize this

My experience leading product and engineering at dentall is that a lot of human team-management instinct applies almost directly to AI agent design.

the person writing code should not also be the only QA
if the spec is fuzzy, execution usually breaks
one person doing everything can look efficient, but quality gets unstable

These are not new lessons. AI is just replaying them.

5. You also have to remember to remove scaffolding

One of my favorite parts of the article came near the end.

They built a lot of extra scaffolding because Opus 4.5 had clear context-anxiety behavior. Once Opus 4.6 improved, some of that scaffolding no longer made sense, so they removed it.

That idea applies beyond AI. A lot of team processes and management rules exist because something broke in the past. But once the underlying condition changes, those extra layers may become unnecessary weight.

Closing note

This article pushed me further toward one belief: the way AI is organized is becoming a topic in its own right.

At some point the question stops being "is the model good enough?" and becomes "how are you arranging the work around it?"

PS

One line from the Anthropic piece really stayed with me: it is much easier to train an evaluator to be strict than to ask a generator to be honest about its own work.

Which is probably why companies have QA teams instead of letting engineers say "I tested it, looks fine."

Related Case Study

Related case studies

SEA Super-App Tech Advisor

2020-2021

Supporting enterprise-grade delivery inside a major Southeast Asian consumer platform

Through a Silicon Valley partner, I contributed to a large Southeast Asian super-app program where the real challenge was reliable delivery under high integration and traffic demands.

Technical Advisor / Enterprise Platform Delivery

Enterprise ArchitectureSuper AppPlatform DeliveryTechnical Advisory

market scale

SEA scale

system bar

Enterprise-grade

delivery mode

Cross-team

Anonymous Southeast Asian super appConsumer Platform / Enterprise Architecture

View Case Study

dentall AI tooth-chart and clinical-text product visual

Flagship Venture

2018-Present

dentall: building the platform, AI layer, and governance base together

At dentall, I was growing the product and engineering organization while also helping build the cloud HIS, the AI product line, and the governance base underneath it.

CTO / Org Builder & AI Product Lead

Dental SaaSHealthTechAI ProductsEngineering LeadershipISO 27001

clinic footprint

3,000+

company scale

60-100

ISO buildout

4 months

3,000+ dental clinics and platform users in TaiwanDental SaaS / HealthTech / AI

View Case Study

Get in touch

Anthropic's new article made me more certain that AI agents also need role separation. A lot of team lessons get repeated almost exactly.

Email LinkedIn Facebook

Would you let the same engineer code, test, review, and deploy alone?

Key takeaways first

Would you let the same engineer code, test, review, and deploy alone?

1. The problem with one agent doing everything

2. Borrowing the idea from GANs

3. The numbers are hard to argue with

4. Anyone who has led a team will recognize this

5. You also have to remember to remove scaffolding

Closing note

PS

Common questions

Related case studies

Supporting enterprise-grade delivery inside a major Southeast Asian consumer platform

dentall: building the platform, AI layer, and governance base together

Related posts

How did a non-engineer ship 263 commits with AI in 30 days?

What if your LINE could do work for you, not just chat?

Get in touch