Did Claude Code actually get dumber?

Anthropic confirmed the experience did degrade over the past month. The cause was not the model, but three product-layer changes — reasoning effort going from high to medium, a cache bug, and a verbosity limit — that each traded away some judgement quality. All three are now reverted or fixed.

Why was this kind of quality drop so hard for testing to catch?

Traditional QA bugs are boolean — input A should produce output B. LLM bugs look more like distribution shift — the model still produces something that looks like B most of the time, but slightly worse. Standard unit and E2E tests struggle to catch these.

What can my own team take from this?

Treat prompts as production code: version control, tests, staged rollouts, regression eval, soak periods, rollback plans. And do not let the dashboard only show latency, tokens, and verbosity — design eval that can measure distribution.

TY WangApril 24, 20267 min read

Last updated: April 24, 2026

Claude Code did not get dumber, it got managed worse

Anthropic acknowledged on April 23 that Claude Code quality has dropped. The real culprit was not the model — it was the system around the model: the harness.

Claude CodeAI Product ManagementHarness DesignAI System Design

TL;DR

Key takeaways first

>Anthropic acknowledged on April 23 that Claude Code quality has degraded over the past month — but the cause was not the model. It was three product-layer changes: reasoning effort, cache handling, and system prompt.

>All three changes were doing the same thing — making Claude Code faster, cheaper, shorter — and quietly traded away reasoning depth, context memory, and judgement quality.

>The harness — the full system around the model that lets it actually work — is becoming a baseline skill. Prompts are increasingly production code that needs version control, tests, and staged rollout.

Claude Code did not get dumber, it got managed worse

Claude Code did not get dumber. It got managed worse.

Anthropic published a worth-reading post-mortem on April 23, formally acknowledging that the experience of using Claude Code, Claude Agent SDK, and Claude Cowork has degraded over the past month.

The interesting part is not "Claude broke."

The interesting part is what Anthropic actually said: the API and inference layer were fine. The model itself was not deliberately downgraded. The real problems came from three product-layer changes — reasoning effort, context / cache management, and system prompt.

In other words, this was not model regression.

It was the system around the model — the harness that lets it actually do work — that managed it into looking dumber.

1. Three optimisations were all cutting from the same place

Each of the three issues looks reasonable on its own.

On 3/4, to lower latency, the default reasoning effort in Claude Code was changed from high to medium. Complex coding task quality dropped, reverted on 4/7.

On 3/26, to keep idle sessions older than an hour from re-eating tokens on resume, Anthropic shipped a cache optimisation. A bug caused the system to keep clearing old thinking on every subsequent turn instead of just once. Claude started forgetting context, repeating itself, and picking the wrong tools. Fixed on 4/10.

On 4/16, to reduce verbosity, the system prompt got something like "no more than 25 characters between tool calls, no more than 100 in a final response." Coding quality eval scores dropped, reverted on 4/20.

Three different changes on the surface, but all doing the same thing underneath:

Make Claude Code faster, cheaper, shorter.

The problem is that in an AI agent, "cheaper" is rarely free. What you save might be latency. It might also be reasoning depth, context memory, or judgement quality.

2. There is no "intelligence" column on the dashboard

The reason this is worth attention from product and engineering leaders is that it looks a lot like a KPI management problem.

A CTO dashboard usually has latency, token cost, request volume, error rate, availability.

It rarely has a column called "intelligence."

Latency down 5 % — visible
Cost down 10 % — visible
Output length down 30 % — visible
Judgement down 3 % — not necessarily visible

You will probably only notice it days or weeks later, slowly, through engineer complaints, support tickets, Reddit, X, HN: something feels off lately.

This is exactly why AI products get pulled around by the metrics that are easy to see.

In traditional software, we already know this dynamic. If a team only watches story points, they start doing low-value work. If they only watch ticket close rate, they start sacrificing actual problem solving.

It is the same with AI agents.

If you only watch latency, tokens, and verbosity, you may quietly grind down the most important thing — and the hardest thing to measure: depth of reasoning.

3. LLM bugs live inside distributions

The most striking thing in the post-mortem was not the three bugs themselves. It was Anthropic admitting that internal use and eval did not reproduce the issues at first.

And one of the cache bugs passed multiple layers of human code review, automated code review, unit tests, E2E tests, automated verification, and dogfooding — and still slipped through for a week.

That is worth sitting with.

In traditional QA, a bug is often boolean.

Input A should produce output B. If output B is missing, it is broken.

LLM product bugs are usually not like that.

They look more like distribution shift.

For input A, the model usually still produces something that looks like B. But the average quality dropped a little. The reasoning got a bit shallower. The context memory weakened a touch. The tool selection got slightly off.

The hardest part is that this kind of bug does not raise an error.

It disguises itself as "still kind of works, just feels dumb lately."

This is a reminder for any team building AI products: traditional testing still matters, but it is not enough. You also need eval that can measure distribution. You need regression tests against real workflows. And you need to treat power users' lived experience as an early warning system, not as internet noise.

4. Prompts are production code

This event is also a reminder of something that bears repeating: a system prompt is no longer just a piece of text.

It is production code.

A seemingly harmless prompt change can cause quality regression. A default setting can change a user's trust in the entire model. A context pruning rule or cache optimisation can make an agent forget why it just made a decision.

When we change code, we have code review, unit tests, integration tests, staging, canary, rollback.

But many teams change prompts in the spirit of "I think this phrasing is clearer, let's just push it."

Once AI is in production, that gets dangerous.

Prompts need version control
Context pruning needs tests
Default settings need staged rollouts
Agent behavior needs regression eval
Any change that can affect "intelligence" needs a soak period and a rollback plan

It sounds engineering-heavy, but this is the foundation of trust.

5. Harness will become a baseline skill

This event made me more sure of one thing: AI products will grow a new specialty.

Not training models. Not just writing prompts.

Designing and maintaining the AI harness.

Harness is a slightly nerdy word, but it just means the entire system around the model that lets it work:

prompt stack, tool routing, context compaction, cache policy, permission model, eval, rollout, telemetry, rollback.

When we built SaaS, we cared about database schema, auth, queues, logs, observability.

When we build AI agents, we will need an additional layer of behavior infrastructure:

What can it see?
How long does it remember?
When should it think harder?
When should it ask a human?
How does it verify its own work?
How does it fail safely under uncertainty?

These things are usually not the showy part of a demo.

Demos want a wow moment.

Production wants the same kind of task to still be done well next week.

The Claude Code situation is a clean demonstration: a wrong harness change can make a frontier model look like it is regressing. A good harness can keep that same model performing reliably inside real workflows.

Last week I wrote The cheaper you try to be on tokens, the more they cost — about how users who only optimise for token spend often end up burning more money. This week Anthropic showed the other side of the same coin: an AI company that only optimises for faster, cheaper, shorter can quietly sacrifice the quality the user actually needs.

These are two faces of the same coin.

In the AI product world, "saving" is not a neutral verb.

It is always trading against reasoning depth, context, or stability.

Closing note

I actually think Anthropic publicly unpacking this is a positive signal.

A regression is a regression. Claude Code is something a lot of engineers depend on every day, and that kind of degradation directly damages trust.

But this post-mortem has value because it took a problem that many people only vaguely felt and turned it into a system problem that can be discussed, tested, and fixed.

In the early days, people watched model scores. Then people watched whether agents could complete tasks. Going forward, people will watch whether the entire AI work system is reliable, observable, and rollback-able.

As a product person and engineering leader, I am increasingly convinced: the point of adopting AI is not buying the strongest model.

It is designing a system that lets a strong model perform reliably.

Models will keep getting stronger, but what really decides production-grade quality will be the boring stuff: defaults, prompts, context, cache, eval, rollback.

The same fundamentals we already learned in software engineering, now to be re-applied to AI agents.

The model is not the product. The system is.

PS

The post-mortem itself is unusually honest, and worth a complete read for anyone building AI products. Although I suspect a lot of engineers' first reaction will be: so it was not me getting worse — it really did get worse. Huh.

References:

Related Case Study

Related case studies

Crosspoint AI posture assessment product visual

Flagship Venture

2018-Present

Crosspoint: turning AI posture assessment into something chain fitness teams would actually use

By keeping the system wearable-free, I was able to take AI posture assessment into real gyms like WorldGym and RIZAP. What mattered most to me was not the demo, but whether coaches would actually use it.

Founder / AI Product & GTM Lead

AI Posture AssessmentComputer VisionFitnessTechWorkflow Integration

major chain customers

3 chains

WorldGym deployment

TW rollout

wearable-free stack

100% Pure Vision

WorldGym, RIZAP, MegaFit, and othersFitness / Computer Vision / B2B SaaS

View Case Study

Botmize

2016-2017

Building the analytics layer for chatbots before the market matured

Botmize was designed as Google Analytics for chatbots, and it also became a vehicle for technical thought leadership, community building, and fundraising.

Founder / Conversational Analytics

Conversational AIAnalyticsFounderDeveloper Community

Chatbot Magazine writer

Top 100

meetup attendees

100+

investment signal

Zeroth-backed

Global chatbot builders and product teamsConversational AI / SaaS / Analytics

View Case Study

Get in touch

Anthropic acknowledged on April 23 that Claude Code quality has dropped. The real culprit was not the model — it was the system around the model: the harness.

Email LinkedIn Facebook

Claude Code did not get dumber, it got managed worse

Key takeaways first

1. Three optimisations were all cutting from the same place

2. There is no "intelligence" column on the dashboard

3. LLM bugs live inside distributions

4. Prompts are production code

5. Harness will become a baseline skill

Closing note

PS

Common questions

Related case studies

Crosspoint: turning AI posture assessment into something chain fitness teams would actually use

Building the analytics layer for chatbots before the market matured

Related posts

What Opus 4.7 really shipped was more than a stronger model

Trying to use AI more frugally can actually make it more expensive

What I actually learned from the Claude Code source leak

Get in touch