GPT-5.5 Hallucinates 86% of the Time. Here's How to Use It Anyway.

GPT-5.5 leads every benchmark — except one. On AA-Omniscience, it's 57% right and 86% confidently wrong. Here's when to use it, and when not to.

OpenAI shipped GPT-5.5 on Wednesday and every tech outlet led with the same number: 84.9% on GDPval, the benchmark that measures AI performance on real workplace tasks across 44 professions. Highest score ever. It also tops the Artificial Analysis Intelligence Index, beats Claude Opus 4.7 on reasoning, and does it with about 40% fewer output tokens than GPT-5.4.

And then, buried in the Artificial Analysis evaluation page that came out the same afternoon, was the number nobody’s quoting: 57% accuracy on AA-Omniscience, with an 86% hallucination rate.

For context, Claude Opus 4.7 hallucinates at 36%. Gemini 3.1 Pro hallucinates at 50%. GPT-5.5 hallucinates at 86%. On the benchmark specifically designed to measure “how often does this model confidently tell me something wrong,” GPT-5.5 is the worst offender of any flagship model.

Both things are true. It’s the smartest model you can rent by the token, and it’s the most willing to make stuff up. Understanding the gap between those two facts is the difference between using GPT-5.5 well and blowing up a client report on Monday morning.

What AA-Omniscience actually measures

Artificial Analysis is one of the independent benchmarking outfits that tracks AI models — the same folks who publish the Intelligence Index that OpenAI quoted in its own launch post. They built AA-Omniscience to stress-test factual knowledge across 40-plus domains and, crucially, to penalize confident wrong answers more than admitting “I don’t know.”

The scoring has two numbers for every model:

  • Accuracy — when the model answers a factual question, how often is it right?
  • Hallucination rate — when the model doesn’t know something, how often does it make up a confident answer instead of refusing?

A model that says “I’m not sure” a lot might score low on accuracy but also low on hallucination. A model that always answers, even when it’s guessing, scores high on both if it’s often right — and very high on hallucination when it’s wrong.

Here’s the leaderboard on the hallucination axis, pulled from Artificial Analysis on April 24:

ModelAccuracyHallucination rate
GPT-5.5 (xhigh)57% (highest ever)86%
Claude Opus 4.7 (max)lower36%
Gemini 3.1 Pro Previewlower50%

GPT-5.5 answers more questions correctly than any model Artificial Analysis has tested. It also answers confidently wrong more often than any flagship. Artificial Analysis’s own language: “GPT-5.5 is better at the right answer when it knows it, but also more willing to confabulate when it doesn’t.”

That last word is the one to hold onto. Confabulation isn’t a small mistake. It’s a specific failure mode: the model invents details — names, numbers, citations, dates, regulations — that sound plausible in context, and it delivers them in the same tone of voice it uses when it’s right. There’s no signal in the output that says “I’m guessing.” That’s what the 86% number captures.

The benchmark it wins, and the one it’s confidently wrong on

GPT-5.5 is legitimately a step forward on most things. The numbers OpenAI led with are real.

Where GPT-5.5 genuinely leads:

  • GDPval (knowledge work across 44 occupations): 84.9%, highest ever
  • Terminal-Bench 2.0 (command-line task planning): 82.7%, beats Claude Opus 4.7 at 69.4%
  • OSWorld-Verified (autonomous computer use): 78.7%, edges Claude at 78%
  • ARC-AGI-2 (reasoning puzzles designed to resist memorization): 85%, beats Claude at 75.8% and Gemini at 77.1%
  • FrontierMath Tier 4 (graduate-level advanced math): 35.4% standard, 39.6% in Pro mode, more than double Claude’s 22.9%
  • Long-context MRCR v2 (512K–1M tokens): 74%, roughly double GPT-5.4’s 36.6%

Those are the headline numbers. They say: if you need long-horizon agentic coding, complex planning, hard math, or full 1M-token context, GPT-5.5 is now the top choice.

Where the 86% bites:

  • SWE-Bench Pro (real software engineering tasks from GitHub issues): GPT-5.5 scores 58.6%. Claude Opus 4.7 scores 64.3%. Claude is still better at “fix this bug in a real codebase.” OpenAI’s explanation is that Anthropic may have benefited from training-data overlap, which is impossible to verify from the outside. Take that one at face value until someone runs an independent eval.
  • Any workflow that asks the model a factual question and takes the answer at face value. This is where the 86% shows up. Citation generation, legal or medical claims, regulatory references, specific historical dates, exact phrasing of laws — the kinds of things that feel like they should be correct because the model sounds certain.

Notice what the second category has in common. It’s not code. It’s not math. It’s the soft stuff professionals rely on daily: “What does GDPR article 17 require?” “Cite three cases supporting this argument.” “What’s the FDA labeling requirement for a class II device?” GPT-5.5 will answer every one of those confidently. It will sometimes be wrong. You won’t be able to tell from the answer.

Why 86% happens

The short version: OpenAI trained GPT-5.5 hard on being useful. Useful models answer. Models that refuse questions feel worse in user testing even when their refusals are correct.

The longer version, pieced together from OpenAI’s system card and the Artificial Analysis analysis: the training signal rewards coherent, confident, on-topic output. When the model runs out of actual knowledge, the next-best thing it can produce is still coherent and confident — it just isn’t grounded. Previous models often hedged with “I’m not sure, but…” GPT-5.5 is trained to commit. That commitment helps on tasks where commitment is the right move (planning, reasoning, code) and hurts on tasks where recall is the right move (facts, citations, rules).

Artificial Analysis’s framing is blunt: “GPT-5.5 is better at the right answer when it knows it, but also more willing to confabulate when it doesn’t.” The 14-point accuracy gain over GPT-5.4 came almost entirely from knowing more, not from hallucinating less. That’s the trade.

How to use GPT-5.5 without getting burned

Three workflows. Use them when the task you’re about to run involves facts, numbers, or claims somebody downstream will act on.

Workflow 1: The source-check pass

For any output that contains load-bearing facts — citations, statistics, dates, names, quoted regulations — run it through a second prompt before using it:

“Here’s the response you just generated. For every specific claim with a date, number, name, or quoted text, tell me: (1) the claim, (2) a source you can point to, and (3) your confidence that the source says exactly what you claimed. If you can’t name a source, say so explicitly.”

GPT-5.5 is much more willing to flag uncertainty when you ask it to grade its own output than when you ask it to produce the output the first time. The difference is striking. The second pass often catches 60–80% of the hallucinations the first pass generated. Not all of them. Most of them.

This doesn’t replace verification. It narrows the verification surface. If the second pass flags 4 low-confidence claims out of 20, you now know which 4 to actually check.

Workflow 2: The “is this true?” partner model

For high-stakes outputs, run the same question through a second model with a different training distribution. Claude Opus 4.7’s 36% hallucination rate isn’t great, but it’s less than half of GPT-5.5’s. When both models agree on a factual claim, you’re probably fine. When they disagree, you’ve caught something.

The cost math works out: GPT-5.5 medium matches Claude Opus 4.7 max quality on agentic tasks at about one-quarter of the token cost. Use GPT-5.5 for the first draft. Use Claude Opus 4.7 for the fact-check pass on the same content. The combined cost is usually lower than running everything on Claude, and you get the hallucination catch for free.

Workflow 3: The “show your work” prompt for code

On coding tasks — where GPT-5.5 is legitimately strong — the hallucination shows up as imagined libraries, imagined function signatures, and imagined API endpoints. For anything you’re about to actually run, make the model show its work:

“For every external API, library, or function you used in this code, list: (1) the exact name, (2) the import or version used, (3) a one-line note on what it does. If you used something you’re not 100% sure exists, mark it MAYBE.”

Then grep for MAYBE in the response. Most hallucinated libraries get flagged by this prompt because the model, when forced to enumerate, hesitates on the ones it fabricated.

None of these three workflows are novel. What’s new is that with GPT-5.5 specifically, they’ve shifted from “good practice” to “necessary for high-stakes work.” The 86% number moves the bar.

When to use GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro

Based on the benchmark split, here’s the decision framework:

If your task is…Best choiceWhy
Long-horizon agentic coding (20+ hours of human-equivalent work)GPT-5.5Terminal-Bench 2.0 leader at 82.7%; doesn’t get lost in multi-step workflows
Complex reasoning, novel problem-solving, puzzlesGPT-5.5ARC-AGI-2 at 85%; best on problems without training-data overlap
Math, especially graduate-levelGPT-5.5 ProFrontierMath Tier 4 at 39.6%; double the nearest competitor
1M-token context processingGPT-5.5MRCR v2 at 74%; massive gap over prior generation
Fixing real bugs in real codebasesClaude Opus 4.7SWE-Bench Pro winner at 64.3% vs GPT-5.5’s 58.6%
Factual Q&A, citation-heavy work, regulatory contentClaude Opus 4.736% hallucination rate vs GPT-5.5’s 86% on AA-Omniscience
Long, careful drafting with calibrated uncertaintyClaude Opus 4.7Model hedges more naturally when it doesn’t know
Multimodal / visual tasks (charts, images, PDFs)Gemini 3.1 ProLeads multimodal benchmarks; 50% hallucination is still meaningful but less than GPT-5.5
Budget-constrained tasks where quality can slip 1 tierGPT-5.5 (medium)Matches Claude Opus 4.7 max quality at ~25% of the cost

The framework isn’t “GPT-5.5 wins” or “Claude wins.” It’s: match the failure mode to the task. Coding and reasoning can survive confident-wrong answers — the tests catch it, the linter catches it, or the output obviously doesn’t work. Factual recall can’t — a hallucinated citation in a legal brief lands with the same confidence as a real one.

What it costs

API pricing is where the launch story got sharper. Standard GPT-5.5 is $5/$30 per million input/output tokens — exactly double GPT-5.4’s $2.50/$15. Pro is $30/$180 per million, a six-times step up from standard, aimed at the heaviest-duty tasks where you genuinely need the extra accuracy.

OpenAI’s counter is that GPT-5.5 uses about 40% fewer output tokens to complete the same task, because it plans more efficiently. That math only works if your usage is output-token-heavy. For input-heavy workloads (long-context document analysis, for instance), the price increase is real.

For ChatGPT subscribers, GPT-5.5 is already live in Plus, Pro, Business, and Enterprise. Pro-tier users get GPT-5.5 Pro. There’s no separate charge for standard GPT-5.5 inside the ChatGPT app — it’s rolled in.

What this means for you

If you’re a developer: Run your next feature on GPT-5.5. The Terminal-Bench and MRCR numbers are real, and the pricing math (40% fewer output tokens) probably wins on agentic coding workloads. But for any code that makes API calls to external services, run the “show your work” pass before executing. Hallucinated library names are the #1 way the 86% number bites developers.

If you’re using AI for research, citations, or compliance work: Don’t switch to GPT-5.5 for this. Stay on Claude Opus 4.7, or use GPT-5.5 only as a drafting tool with Claude doing the fact-check pass after. A hallucinated citation in a legal filing, a wrong regulatory reference in a policy doc, or a made-up statistic in an exec brief all land with the same weight as a real one — and GPT-5.5 is now the model most likely to produce them.

If you’re evaluating AI tools for your team: The honest recommendation is a two-model workflow. GPT-5.5 for drafting, planning, reasoning, code. Claude Opus 4.7 for anything with a factual claim that someone downstream will act on. The combined cost is often lower than Claude-only, and you get meaningful hallucination protection.

If you’ve never used GPT-5.5 because “it’s just another model”: Try it on a hard reasoning or planning task. The gap between GPT-5.4 and GPT-5.5 on things like ARC-AGI-2 (72% → 85%) and long-context MRCR (37% → 74%) is unusually large. For those specific use cases, it’s a genuine leap. Just know what it’s bad at before you aim it there.

The bottom line: GPT-5.5 isn’t “the best model.” It’s the best model for a specific set of tasks — coding, agentic planning, reasoning, long-context work — and the worst of the flagships for a different specific set — factual recall, citations, anything where wrong-but-confident is the failure mode. Treat it like a tool, not an oracle. The 86% hallucination rate isn’t a bug OpenAI will patch in a week. It’s a training-time choice that shaped the entire model. Use it accordingly.

Who should use it

  • Developers and engineers running agentic or multi-step coding workflows
  • Researchers and analysts working with novel problems where reasoning matters more than recall
  • Anyone doing long-context work above 512K tokens
  • Teams already paying for Claude Opus 4.7 who want to cut costs on parts of their pipeline that don’t need max-tier accuracy

Skip GPT-5.5 as your primary tool if: your workflow is factual-claims-heavy (legal research, medical reference, compliance), you can’t afford to run a fact-check pass, or you’re already getting good results from Claude Opus 4.7 on the specific tasks you care about.

The bottom line

Every flagship model has a shape. GPT-4’s shape was “answers that feel useful.” Claude 3’s shape was “careful reasoning with calibrated uncertainty.” Gemini 1.5’s shape was “huge context window.” GPT-5.5’s shape is “commit to the answer, even when you shouldn’t.” On tasks where commitment is the right instinct, it’s the strongest tool available. On tasks where caution is the right instinct, it’s the riskiest.

The 86% number isn’t a scandal. It’s a design choice. Artificial Analysis published it on the same day as the launch. OpenAI didn’t hide it — they just didn’t lead with it. Both things are fair. What’s not fair is the reader who upgrades to GPT-5.5 assuming they just got a better Claude. What they got is a different tool with different failure modes.

Use it for what it’s great at. Verify when it matters. And if you need citations to land correctly, keep Claude Opus 4.7 around for a little while longer.


Sources:

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume