Last updated: April 24, 2026
Claude Code did not get dumber, it got managed worse
Anthropic acknowledged on April 23 that Claude Code quality has dropped. The real culprit was not the model — it was the system around the model: the harness.
TL;DR
Key takeaways first
>Anthropic acknowledged on April 23 that Claude Code quality has degraded over the past month — but the cause was not the model. It was three product-layer changes: reasoning effort, cache handling, and system prompt.
>All three changes were doing the same thing — making Claude Code faster, cheaper, shorter — and quietly traded away reasoning depth, context memory, and judgement quality.
>The harness — the full system around the model that lets it actually work — is becoming a baseline skill. Prompts are increasingly production code that needs version control, tests, and staged rollout.

Claude Code did not get dumber. It got managed worse.
Anthropic published a worth-reading post-mortem on April 23, formally acknowledging that the experience of using Claude Code, Claude Agent SDK, and Claude Cowork has degraded over the past month.
The interesting part is not "Claude broke."
The interesting part is what Anthropic actually said: the API and inference layer were fine. The model itself was not deliberately downgraded. The real problems came from three product-layer changes — reasoning effort, context / cache management, and system prompt.
In other words, this was not model regression.
It was the system around the model — the harness that lets it actually do work — that managed it into looking dumber.
1. Three optimisations were all cutting from the same place
Each of the three issues looks reasonable on its own.
On 3/4, to lower latency, the default reasoning effort in Claude Code was changed from high to medium. Complex coding task quality dropped, reverted on 4/7.
On 3/26, to keep idle sessions older than an hour from re-eating tokens on resume, Anthropic shipped a cache optimisation. A bug caused the system to keep clearing old thinking on every subsequent turn instead of just once. Claude started forgetting context, repeating itself, and picking the wrong tools. Fixed on 4/10.
On 4/16, to reduce verbosity, the system prompt got something like "no more than 25 characters between tool calls, no more than 100 in a final response." Coding quality eval scores dropped, reverted on 4/20.
Three different changes on the surface, but all doing the same thing underneath:
Make Claude Code faster, cheaper, shorter.
The problem is that in an AI agent, "cheaper" is rarely free. What you save might be latency. It might also be reasoning depth, context memory, or judgement quality.
2. There is no "intelligence" column on the dashboard
The reason this is worth attention from product and engineering leaders is that it looks a lot like a KPI management problem.
A CTO dashboard usually has latency, token cost, request volume, error rate, availability.
It rarely has a column called "intelligence."
- Latency down 5 % — visible
- Cost down 10 % — visible
- Output length down 30 % — visible
- Judgement down 3 % — not necessarily visible
You will probably only notice it days or weeks later, slowly, through engineer complaints, support tickets, Reddit, X, HN: something feels off lately.
This is exactly why AI products get pulled around by the metrics that are easy to see.
In traditional software, we already know this dynamic. If a team only watches story points, they start doing low-value work. If they only watch ticket close rate, they start sacrificing actual problem solving.
It is the same with AI agents.
If you only watch latency, tokens, and verbosity, you may quietly grind down the most important thing — and the hardest thing to measure: depth of reasoning.
3. LLM bugs live inside distributions
The most striking thing in the post-mortem was not the three bugs themselves. It was Anthropic admitting that internal use and eval did not reproduce the issues at first.
And one of the cache bugs passed multiple layers of human code review, automated code review, unit tests, E2E tests, automated verification, and dogfooding — and still slipped through for a week.
That is worth sitting with.
In traditional QA, a bug is often boolean.
Input A should produce output B. If output B is missing, it is broken.
LLM product bugs are usually not like that.
They look more like distribution shift.
For input A, the model usually still produces something that looks like B. But the average quality dropped a little. The reasoning got a bit shallower. The context memory weakened a touch. The tool selection got slightly off.
The hardest part is that this kind of bug does not raise an error.
It disguises itself as "still kind of works, just feels dumb lately."
This is a reminder for any team building AI products: traditional testing still matters, but it is not enough. You also need eval that can measure distribution. You need regression tests against real workflows. And you need to treat power users' lived experience as an early warning system, not as internet noise.
4. Prompts are production code
This event is also a reminder of something that bears repeating: a system prompt is no longer just a piece of text.
It is production code.
A seemingly harmless prompt change can cause quality regression. A default setting can change a user's trust in the entire model. A context pruning rule or cache optimisation can make an agent forget why it just made a decision.
When we change code, we have code review, unit tests, integration tests, staging, canary, rollback.
But many teams change prompts in the spirit of "I think this phrasing is clearer, let's just push it."
Once AI is in production, that gets dangerous.
- Prompts need version control
- Context pruning needs tests
- Default settings need staged rollouts
- Agent behavior needs regression eval
- Any change that can affect "intelligence" needs a soak period and a rollback plan
It sounds engineering-heavy, but this is the foundation of trust.
5. Harness will become a baseline skill
This event made me more sure of one thing: AI products will grow a new specialty.
Not training models. Not just writing prompts.
Designing and maintaining the AI harness.
Harness is a slightly nerdy word, but it just means the entire system around the model that lets it work:
prompt stack, tool routing, context compaction, cache policy, permission model, eval, rollout, telemetry, rollback.
When we built SaaS, we cared about database schema, auth, queues, logs, observability.
When we build AI agents, we will need an additional layer of behavior infrastructure:
- What can it see?
- How long does it remember?
- When should it think harder?
- When should it ask a human?
- How does it verify its own work?
- How does it fail safely under uncertainty?
These things are usually not the showy part of a demo.
Demos want a wow moment.
Production wants the same kind of task to still be done well next week.
The Claude Code situation is a clean demonstration: a wrong harness change can make a frontier model look like it is regressing. A good harness can keep that same model performing reliably inside real workflows.
Last week I wrote The cheaper you try to be on tokens, the more they cost — about how users who only optimise for token spend often end up burning more money. This week Anthropic showed the other side of the same coin: an AI company that only optimises for faster, cheaper, shorter can quietly sacrifice the quality the user actually needs.
These are two faces of the same coin.
In the AI product world, "saving" is not a neutral verb.
It is always trading against reasoning depth, context, or stability.
Closing note
I actually think Anthropic publicly unpacking this is a positive signal.
A regression is a regression. Claude Code is something a lot of engineers depend on every day, and that kind of degradation directly damages trust.
But this post-mortem has value because it took a problem that many people only vaguely felt and turned it into a system problem that can be discussed, tested, and fixed.
In the early days, people watched model scores. Then people watched whether agents could complete tasks. Going forward, people will watch whether the entire AI work system is reliable, observable, and rollback-able.
As a product person and engineering leader, I am increasingly convinced: the point of adopting AI is not buying the strongest model.
It is designing a system that lets a strong model perform reliably.
Models will keep getting stronger, but what really decides production-grade quality will be the boring stuff: defaults, prompts, context, cache, eval, rollback.
The same fundamentals we already learned in software engineering, now to be re-applied to AI agents.
The model is not the product. The system is.
PS
The post-mortem itself is unusually honest, and worth a complete read for anyone building AI products. Although I suspect a lot of engineers' first reaction will be: so it was not me getting worse — it really did get worse. Huh.
References:



