For about six weeks before Anthropic admitted anything, working Claude Code users were quietly typing the same Slack message: “is anyone else feeling like Claude is dumber lately?”
The public version of the same complaint was loud, specific, and well-engaged. On March 20, 2026, @catalinmpit posted:
“Lately, Claude makes some shocking mistakes. Implements overly complex code, Ignores the codebase’s code style, Removes working code for no reason, Replaces code that’s out of scope from the task at hand. It feels like it needs 100% supervision.”
— ~644 likes, ~78K views. On April 7, on r/ClaudeCode, u/marcin_dev opened a thread that drew hundreds of confirming replies: “Is it just me, or has Claude Code become significantly dumber over the past few days?” And by April 13, @safetyth1rd (688 likes, 17K views) was posting flatly: “It’s taking 2-3x longer to do stuff with Claude code than a month or two ago and It does a worse job. The rumors are true they’re throttling this thing.”
Most of us got a thumbs-up reaction on the Slack version and let it go. We blamed our prompts. We blamed our codebases. A few of us blamed ourselves. The dashboard didn’t show anything weird. The benchmarks Anthropic had published months earlier still said the model was great. Maybe we were imagining it.
We were not imagining it.
On April 23, 2026, Anthropic published a Claude Code postmortem confirming that three product changes had degraded Claude Code quality across the prior six weeks, plus a usage-limit reset for all subscribers. For everyone who’d been quietly second-guessing their own pattern recognition, it was vindication and gut-punch in roughly equal measure.
This post is for the working dev who saw the postmortem land in their feed, didn’t have time to read all of it, and needs the decoded version. We’ll cover what Anthropic actually said, why you were right to trust your gut, the cognitive bias trap that delayed the disclosure for six weeks, and — most importantly — the reliability audit framework that makes “the AI got dumber” a measurable claim instead of a vibe.
What the Postmortem Actually Said
Anthropic named three specific changes, with dates, and copped to each:
March 4, 2026 — default reasoning effort dropped from “high” to “medium” to cut latency. Anthropic called it “the wrong tradeoff” in retrospect. The visible effect: Claude Code stopped doing the careful step-back-and-think work it used to do on hard refactors. It looked impatient.
March 26 — a bug discarded reasoning history mid-session. This is the one that hit hardest for power users: across long agentic loops, Claude Code would forget what it had just figured out, redo work, and burn through usage limits faster. If you were thinking “why does it feel like Claude Code has the memory of a goldfish since late March?” — that was an actual bug, not your imagination.
April 16 — a system-prompt instruction capped responses at 25 words between tool calls. Anthropic said this “measurably hurt coding quality” before being reverted four days later. This is the change that made the model feel terse, mechanical, and unable to explain its choices.
All three were resolved by April 20. Three days later, the postmortem dropped, and every Claude Code subscriber got a usage-limit reset. Anthropic separately denied to Fortune that the slowdown was intentional — which it almost certainly wasn’t, but the slow disclosure cycle is the real problem regardless of intent.
When the postmortem hit r/ClaudeCode, the top comment from u/Sufficient-Farmer243 (744 upvotes, 500+ replies) crystallized what most of the community was thinking:
“so basically every single issue they gaslit us for weeks ended up being exactly what we thought it was. I think the community needs to collectively give themselves a pat on the back lol.”
@meetTARSai on X put the personal version more directly: “Claude Code was broken for a month. Three bugs — forgetful, repetitive, worse at coding. I experienced all three… In March I started losing context mid-task. Repeating work I’d already done. Getting worse at coding.” And in the same thread, u/Enthu-Cutlet-1337 connected a specific bug to a specific lived experience: “The 25 word cap between tool calls explains so much, I’ve been seeing opus truncate mid-reasoning on refactors for weeks and thought my prompts were off.” That last sentence — “thought my prompts were off” — is the cognitive trap this whole story turns on.
The honest read: this was a transparent disclosure handled in roughly the way you’d want a vendor to handle silent regression. The problem isn’t the postmortem itself. The problem is the six-week gap between “this is happening” and “we are confirming this is happening” — during which paying customers were left to argue with their own pattern recognition.
Why Your Gut Was Right (and the Trap That Made Everyone Doubt It)
Here’s the part most coverage of the postmortem skips: there’s a well-documented academic precedent for exactly this scenario.
In July 2023, Stanford researchers Lingjiao Chen, Matei Zaharia, and James Zou published “How Is ChatGPT’s Behavior Changing Over Time?” (arXiv:2307.09009), measuring GPT-4’s task accuracy on a fixed evaluation set across multiple snapshots. They found large, statistically meaningful drops on several tasks between the March and June 2023 versions of GPT-4. The paper crystallized what users had been calling “the GPT-4 lobotomy” since spring of that year — and it set the academic precedent that silent model regression is a real failure mode of hosted LLM products, not a folk myth.
Despite that precedent existing for nearly three years, the dynamic in March–April 2026 played out almost identically:
- Working developers noticed the regression first, in scattered Slack threads and r/ClaudeCode posts
- Anthropic’s defenders (and a fair number of users themselves) attributed it to “prompt drift,” “you’re catching edge cases now,” or “you got better at noticing imperfection”
- The vendor’s public benchmarks were unchanged, which created an authority gap — “the official numbers say the model is fine, so you must be wrong”
- Six weeks went by. Tickets were filed. Workflows were quietly degraded. People doubted themselves.
The cognitive trap here has a name: the asymmetry of evidence. When a hosted model degrades, the vendor has full access to telemetry and benchmarks; the user has access only to their own qualitative experience. When those two disagree, the user almost always loses the argument — because vibes are easy to dismiss and dashboards are not.
The Stanford paper’s contribution was making it socially acceptable to say “the model got dumber” as a hypothesis worth testing instead of an unsubstantiated complaint. Anthropic’s Apr 23 postmortem extends that: the field now has its second high-profile confirmation that silent regression happens, and that even the vendor sometimes doesn’t know it’s happening until users push back hard enough to make it a measurable claim.
That’s the part most worth internalizing. You weren’t being sensitive. You were being a halfway-decent test harness for a system whose paid telemetry didn’t catch the bug.
The Real Question: How Do You Make This Measurable Next Time?
Vindication is satisfying. It is also approximately worth nothing the next time it happens — because the next regression is going to start with the same six-week gaslight cycle unless you build the audit habit that closes it.
Here’s the framework that does. It takes about thirty minutes of setup and ten minutes a week to maintain.
1. Pick three to five canary tasks
Choose tasks you run frequently and where the correct output is unambiguous. Good canaries:
- “Generate a TypeScript type from this 200-line JSON sample”
- “Refactor this Python function to handle the off-by-one case in the test”
- “Find the bug in this 50-line C function” (with a known bug + known fix)
- “Summarize this 5,000-line PR diff in 5 bullet points”
Bad canaries: open-ended creative writing, design suggestions, anything where “good output” is judgment-call. You want tasks where you can run them next week and know whether the answer is the same quality.
2. Record a baseline before you need it
For each canary, record on a specific date with a specific model version:
- Cost (in dollars, from the API receipt)
- Wall-clock time
- Tool-call count (how many tools the agent invoked)
- Success / fail / partial-success
- The verbatim output, saved to a file you can diff against later
This is the single most-skipped step. Engineers wait until they think the model is degrading, then try to figure out what “good” looked like — and discover they have no record. Don’t be that engineer. The point of a baseline is that it exists before the question of “did this change?” gets asked.
3. Re-run weekly. Diff against baseline.
A ten-minute habit: every Friday afternoon, run the canary set, record the same five metrics, diff the verbatim outputs against the baseline. If something has shifted — cost up 30%, tool calls doubled, an output that used to be right is now wrong — you have evidence, not a vibe, and you have it within a week of the regression starting instead of six weeks.
For binary canaries (find-the-bug tasks where there’s a single correct answer), you can automate the diff: a bash script that runs the prompt, grep’s for the expected line, and exits 0 or 1. Wire that into a cron job and you have ambient regression detection for the price of one weekly cron run.
4. Distinguish drift from noise
Not every diff is a regression. LLMs are stochastic. The same prompt run twice will give slightly different outputs. The discipline is running each canary three times in your weekly check and looking for consistent shifts, not single-instance variation.
A signal you actually care about looks like: “This canary completed in 12 tool calls every week for two months. Last week it completed in 19. This week, 21. The output is also missing a key step it used to include.” That’s drift, three data points, consistent direction, with a behavioral assertion (the missing step) to back it up. That gets a Slack message to the team. A single fluctuation does not.
5. Build a two-engine fallback
The single biggest mistake the Mar–Apr 2026 cohort made wasn’t trusting Anthropic — it was having no parallel pipeline to compare against. If Claude Code was your only AI coding tool, “is it getting worse?” had no comparison surface; if you’d been running canaries through both Claude Code and a second engine (Codex, Cursor, OpenClaw, the new DeepSeek V4 Anthropic-compatible endpoint), the divergence between them on canary tasks would have been a much clearer signal than absolute degradation alone.
This is the two-engine insurance pattern: keep two independently-developed AI coding stacks audited against the same canaries. When one degrades and the other doesn’t, you have something better than evidence — you have causation. And critically, you can switch primary engines in the time it takes to update environment variables, instead of doing a multi-week migration under degradation pressure.
For Claude Code specifically, the simplest second engine right now is DeepSeek V4-Pro through the Anthropic-compatible endpoint (covered in detail here). Same CLI, four env vars, you can flip between them in 30 seconds. That’s redundancy you’d want regardless of vendor opinion.
Why Anthropic-Specific Matters Less Than the Pattern
If your read of this post so far is “Anthropic bad, switch to X” — that’s not the take. Anthropic ran a transparent postmortem. Most vendors don’t. The Stanford paper documented OpenAI doing the same silent-degradation thing with GPT-4 in 2023. The pattern is how hosted-LLM products work, not a specific vendor failure mode.
What changes after April 23, 2026, is that we have two well-documented confirmations of silent regression spread over almost three years across two different vendors. The reasonable working assumption is that hosted-LLM products will silently regress periodically, and that vendor benchmarks lag user-observed quality by weeks to months. The mature engineering response isn’t outrage — it’s the audit framework above, plus the two-engine fallback, run as ambient infrastructure.
The good news: this work pays off the first time you catch a regression. Even one early detection — “the assistant has gotten 40% more expensive to complete this task in the last two weeks, here are the receipts” — is the difference between billing surprise and informed decision. Most teams don’t have this. Building it makes you the person who does.
What’s Worth Doing This Week
If you’re a working Claude Code user reading this and wondering what concretely to do:
- Pick three canary tasks you do most weeks. Spend twenty minutes today recording a baseline (cost, time, tool calls, output) for each.
- Set a recurring 30-min Friday block to re-run them. Diff against baseline. File the output in a folder you can grep later.
- Configure a second engine at the env-var level — DeepSeek V4-Pro via Anthropic-compatible endpoint is the lowest-friction option right now. You don’t have to use it daily; you have to be able to swap to it in 30 seconds.
- Don’t trust your memory of “good output” three weeks from now. It’s worse than you think. Save the baseline files. Diff explicitly. Let the receipts argue for you.
If you want the full framework — observability for agentic loops, the diff-test protocol, the reliability runbook template, and the two-engine insurance pattern as a deployable artifact — the Claude Code Reliability Audit course walks through it in eight lessons with concrete recipes you can paste into your project today. It’s the systematic version of this post: same anchor in the Apr 23 postmortem, same Stanford precedent, same audit habit, but with the spreadsheet templates, the bash scripts, and the cost-quality Pareto framework worked out so you can deploy the full operating manual instead of the high-level principle.
The lesson of the postmortem isn’t “Anthropic got caught.” It’s that silent regression is a normal failure mode of every hosted LLM product, your gut is a halfway-decent detector for it, and the only durable answer is making your gut measurable. Six weeks of the Mar–Apr cohort doubting themselves is six weeks too many for the next one. Build the canaries. Run the diffs. Let the data argue with the dashboard. That’s the answer.
The next time someone in your Slack types “is anyone else feeling like Claude is dumber lately?” — be the person with the receipts.