In April 2026, something odd happened in AI.
VentureBeat, The Register, and Fortune all ran stories about Claude Opus 4.6 getting worse. Reddit, X, GitHub, Hacker News — the chorus was everywhere. One user posted an analysis of 6,852 coding sessions showing measurable regression.
Meanwhile, on the LMSYS Chatbot Arena, that same Opus 4.6 held the number-one spot overall.
The top-ranked AI was being called broken. I kept returning to this contradiction.
I use Cursor and switch between Claude Code and OpenAI's Codex. I came back to Claude Code. The decline everyone talks about — I have not felt it. But the internet sees a different reality.
So I traced what actually happened.
What happened from February to April
February 5: Opus 4.6 launched with adaptive thinking enabled by default. A normal update.
The shift came weeks later.
On March 3, Anthropic lowered the default reasoning effort from "high" to "medium." Users had complained about excessive token consumption, and the change was disclosed in the changelog. On March 8, thinking content switched to summarized display.
Here is the thing. The effort change did not touch the model weights — the "brain" itself. It changed how deeply the model thinks by default. That is not "getting dumber." It is closer to "given less time to think."
But from the user's side, the distinction vanishes. Shallower output feels like decline.
April accelerated everything.
Stella Laurenzo, AI strategy director at AMD, posted GitHub issue #42796. She analyzed 6,852 Claude Code sessions and 234,760 tool calls. File reads before edits dropped from 6.6 to 2.0. Edits without reading jumped from 6.2% to 33.7%. User interventions increased by 1,167%.
This was persuasive. Data, not vibes.
But around the same time, another piece of "evidence" appeared. BridgeMind AI claimed their benchmark "BridgeBench" proved Opus 4.6 had been "nerfed." Scores dropped from 83.3% to 68.3%, they said.
Both presented numbers. Both said "declined."
The resemblance ends there.
BridgeBench had serious methodological problems. It compared 6 tasks against 30. On the overlapping tasks, scores were 85.4% versus 87.6% — essentially unchanged. Yet "98% increase in hallucinations" spread across Reddit and X.
Is this one problem or four?
Tracing the timeline revealed something I had not expected.
Under the single word "declined," at least four distinct phenomena were mixed together.
First: the effort setting change. A deliberate product decision by Anthropic. Not model degradation.
Second: infrastructure bugs. Anthropic published a postmortem in September 2025. Routing errors sent 16% of requests to the wrong servers. A compiler bug dropped the highest-probability tokens from the selection pool.
Third: context corruption. Chroma's research confirmed that LLM performance degrades measurably as input length grows. Long coding sessions accumulate this effect.
Fourth: cognitive bias. Expectation inflation, peak-end rule, community amplification. The better you know a tool, the higher your baseline. Yesterday's best output becomes today's "normal."
These four have different causes and different fixes. Effort settings can be overridden manually. Infrastructure bugs need Anthropic to patch. Context corruption can be managed through session design. Cognitive bias requires questioning your own baseline.
Yet all four collapsed into one word.
When a word becomes a filter
This is where it gets interesting.
Compressing different phenomena into one word is ordinary. "That restaurant got worse" — did the ingredients change, did your palate adjust, or was it just an off night? We do not bother distinguishing.
But in AI, the compressed word was doing something else.
I went looking for why BridgeBench spread despite its broken methodology, and found a psychology paper. Ziva Kunda's 1990 study on "motivated reasoning" — cited over 21,000 times. The core finding: when people have a directional goal ("I want this to be true"), they become lenient with supporting evidence and harsh with contradicting evidence. But they cannot distort freely. They need something that looks like evidence.
BridgeBench's numbers served exactly that function. There were percentages. A benchmark name. An organization name. The appearance of science was enough. The methodology did not get examined before the conclusion was accepted.
Meanwhile, LMSYS showing Opus 4.6 at number one was dismissed: "Benchmarks don't measure real-world use." Same category of data, opposite treatment.
I think of this as the contagion of "declined."
The word does not transmit as information. It transmits as a filter. Once you hear it, you start scanning your own experience for evidence of decline. A shallow output that would have been unremarkable becomes proof. You find it. You are certain. You post about it. Someone reads your post and starts scanning too.
Once the loop is running, counterevidence stops working. "It is still number one on benchmarks" gets met with "benchmarks are not real-world." "Try setting effort to high" gets met with "that is not the point."
At this point you might be thinking: so it was all bias? The people complaining were wrong?
No. That is not it either.
Laurenzo's data was methodologically sound. 6,852 sessions of behavioral data is not vibes. Anthropic did change the effort setting. Boris Cherny, Claude Code's lead, explained the timeline directly on GitHub.
"It is contagion, therefore fake" does not hold. Real changes happened. But the real changes and the amplified perception are tangled together in a way that is genuinely hard to separate.
That is what makes this difficult —
Subscribe to Default to read the rest.
Become a paying subscriber of Default to get access to this post and other subscriber-only content.
Upgrade