OpenAI shipped GPT-5.5 Instant as the new default ChatGPT model on Tuesday afternoon, the day before Anthropic opened Code with Claude SF. The launch tweet hit 8,056 likes inside the first six hours. The interesting part isn’t the marketing copy — it’s that this is the first OpenAI release that puts an explicit hallucination-reduction promise on the record for three specific verticals: law, medicine, and finance. Anthropic’s Sonnet 4.6 has been the cost-anchored alternative for those workloads for two quarters. So the question every engineering manager will get this week is the same one: which model, on which surface, for which workload, this quarter?
This is a five-dimension head-to-head written for the people doing the routing, not the people doing the marketing. No “AI showdown” framing. The decision matrix at the bottom is what your Slack channel will end up using.
What actually changed Tuesday
GPT-5.5 Instant is OpenAI’s new default ChatGPT model, replacing GPT-5.3 Instant. The number that matters: in OpenAI’s internal evaluations, GPT-5.5 Instant produced 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts covering medicine, law, and finance, plus a 37.3% reduction in inaccurate claims on conversations users had previously flagged for factual errors. Responses are also about 30% shorter and 29% fewer lines — OpenAI explicitly tuned for concise output instead of the bullet-and-emoji wall that became the GPT-5.3 Instant default. Latency stays at the previous tier; this isn’t a thinking model, it’s the fast tier with sharper retrieval and fewer hedges.
The model is available immediately to all ChatGPT users (Plus and Pro getting it on web first, mobile next, Free/Business/Enterprise after) and via the API as chat-latest. GPT-5.3 Instant remains accessible to paid users for three more months as a fallback during eval rebuilds. The memory-and-Gmail feature that landed alongside the model — “memory sources” that let the model cite past chats, files, and Gmail messages by reference — is the policy-layer story IT teams will spend the rest of the week on.
For our purposes today, we’re routing workloads. The model lives at one tier. Compare it to Claude Sonnet 4.6 across the five dimensions that actually move the routing decision.
The five-dimension routing call
Dimension 1 — Pricing
Claude Sonnet 4.6 lists at $3 per million input tokens and $15 per million output tokens. GPT-5.5 standard (the larger sibling, accessed via API for non-Instant routes) lists at $5 input / $30 output. Sonnet 4.6 is 1.9× cheaper per token at like-for-like volumes. Sonnet 4.6’s 200K context window is enough for most agentic workloads; GPT-5.5’s 1.1M context is a real lever only when you’re routing very long-document workloads.
For a team running two million output tokens per day across an agent fleet, that’s a $30/day delta on output alone — small per day, $11K per year before traffic growth. Pricing is the most predictable variable in this comparison and the one that compounds.
GPT-5.5 Instant specifically — the new default — is priced via chat-latest. It’s the cheaper end of the GPT-5.5 family on the API, but it’s still an Instant tier; for heavy workloads where Sonnet 4.6 has been the workhorse, the per-token math still favors Sonnet for the steady state.
Routing implication: If your workload is high-volume and you’re not bound to a specific feature, Sonnet 4.6 is the default. The cost delta only inverts when GPT-5.5’s hallucination tuning is the load-bearing reason you’re choosing it.
Dimension 2 — Hallucination tuning by vertical
OpenAI named law, medicine, and finance explicitly. The 52.5% reduction figure is internal-evaluation data, not third-party reproduced — but the targeting is real. The pre-print operator-class take from inside OpenAI’s GPT 5.5 launch coverage is that “hallucination reduction in law, medicine, and finance is the part operators should test.” That’s the right framing.
Sonnet 4.6’s profile is broadly stable across verticals. Anthropic’s training approach (Constitutional AI, the Cowork enterprise context-share patterns) has not been verticalized in the same explicit way; Sonnet 4.6 is more uniform but doesn’t have an OpenAI-style “we tuned for these three verticals” claim.
Routing implication: If your traffic is dominated by high-stakes legal-research, healthcare-clinical, or finance-citation workflows, GPT-5.5 Instant has the explicit tuning advantage out of the gate. Test before committing: take the last 50 prompts in your most error-prone vertical workflow, run both models, score them yourself. If GPT-5.5 wins by more than ~5 percentage points on factual accuracy, the routing is worth the price delta. If it doesn’t, the price math wins.
For verticals not on OpenAI’s named-three list (engineering, customer support, internal docs, sales ops), the hallucination-tuning argument doesn’t apply and the call falls back to dimensions 1 and 5.
Dimension 3 — Coding benchmarks
GPT-5.5 standard leads on SWE-bench Verified by 9.1 points (88.7 vs Sonnet 4.6 at 79.6). On Terminal-Bench 2.0, GPT-5.5 also leads. Sonnet 4.6 outperforms GPT-5.5 only on the Finance Agent benchmark — interesting given OpenAI’s hallucination-targeting in finance, suggesting the tuning helped accuracy but not full agentic execution on those tasks.
The X reactions overnight are mixed in a useful way for routing. @franklinto: “GPT 5.5 is better than Sonnet 4.6 at debugging.” @giordanorandone: “Codex was already doing a better job than Opus-4.7 in coding.” But @smithstephen rated Claude Opus 4.7 at 9/10 vs GPT-5.5 at 3-4/10 for “polished presentation,” and noted he prefers GPT-5.5 specifically inside Codex (the agentic coding workflow). @gabriel_horwitz captured the output-style complaint: GPT responses are “short lines, bullets, emojis… super long but it’s a scroll… less professional” vs Claude’s paragraph-form prose.
Routing implication: For agentic-coding workflows where Codex is your harness and the input is an issue or test failure, GPT-5.5 is the stronger pick on benchmarks and matches the harness OpenAI optimized for. For codegen feeding into reviewable, paragraph-form output (technical docs, code review explanations, architecture-decision records), Sonnet 4.6’s prose is the better default. Routing by harness, not by model, is the actually-useful framing.
Dimension 4 — Context window and document workloads
GPT-5.5: 1.1M tokens. Sonnet 4.6: 200K tokens. Five-and-a-half times the window for GPT-5.5.
For most production workloads — chat sessions, agent loops, code edits with file-scope context — 200K is more than enough. The 1.1M lever matters specifically for: bulk-document review where you’re feeding entire SEC filings, contract bundles, deposition transcripts, or full codebases (>200K tokens) into a single call; long-running multi-turn agent transcripts that exceed Sonnet’s window mid-session and force chunking.
The cost math also flips for long-document workloads: at 1.1M tokens of input on GPT-5.5, the input bill alone is $5.50 per call. Sonnet at 200K input is $0.60 per call. If you can chunk into Sonnet, you should — the orchestration overhead is almost always cheaper than the GPT pricing premium.
Routing implication: Default to Sonnet 4.6. Switch to GPT-5.5 specifically when chunking is impossible or breaks document semantics — long-form legal analysis, full-codebase refactor planning, end-to-end research syntheses where the context-share matters. Don’t pay for window you don’t use.
Dimension 5 — Output style and IT-policy fit
The X community split on Tuesday afternoon was almost entirely along output-style lines, not capability lines. GPT-5.5 Instant is now noticeably more concise but keeps the bullet-list-and-emoji formatting that some teams like and others actively dislike. Sonnet 4.6’s prose is paragraph-form and reads as more “polished” to enterprise audiences (per @smithstephen and a half-dozen similar posts).
The memory-and-Gmail feature is the IT-policy story. ChatGPT can now reference past chats, files, and Gmail when the user enables it, and surface “memory sources” that cite which past content informed the answer. For Plus/Pro users on company devices, that opens four IT-admin questions this week:
- Are personal Gmail accounts in scope when employees enable the feature on a company device?
- What’s our DLP policy for the memory writes?
- What’s our SCIM/identity-provider story for org-controlled accounts?
- What’s the user-comms timeline before staff turn it on?
The default-on rollout means most IT teams will need a block-by-default-or-allow-with-policy decision by end of week. Anthropic’s analogue is the M365 cross-app context-share that landed April 30 — different product shape, similar policy decision.
Routing implication: Output-style preferences are real and they’re stable per audience. If your output is being read by enterprise customers on a screen, the prose-form Sonnet bias is durable. If your output is being parsed by another agent or by an internal dev, the GPT formatting is fine. For the memory feature, the policy decision is independent of the routing decision — you can run GPT-5.5 Instant for some workloads and have memory disabled in your tenant.
The Q3 routing matrix
Strip the comparison down. Five workload archetypes, the model that should default for each, and the actual reason.
| Workload | Default | Why |
|---|---|---|
| High-volume agentic loops (general) | Sonnet 4.6 | 1.9× pricing edge dominates at volume |
| Legal research / healthcare clinical / finance citation | GPT-5.5 Instant | Verticalized hallucination tuning; test on your last 50 prompts |
| Codegen inside Codex harness | GPT-5.5 standard | SWE-bench leadership + harness alignment |
| Codegen for reviewable, paragraph-form output | Sonnet 4.6 | Prose default reads as polished |
| Bulk-document review (>200K tokens, no chunking) | GPT-5.5 standard | Only practical option at that context |
This matrix should outlive the launch news cycle. The two stable lanes — Sonnet 4.6 for high-volume cost and prose output, GPT-5.5 for vertical-tuned high-stakes and long-context — are durable. Code routing depends on which harness you’re already in.
What the comparison can’t tell you
A few honest limits, because this is the post you’ll get pushed back on.
OpenAI’s hallucination-reduction numbers are internal-evaluation. 52.5% sounds dramatic; you should reproduce it on your own prompts before staking a routing decision on it. The right reproduction is your last 50 prompts in the highest-stakes workflow, scored by a domain expert (not a model). That’s a half-day of work and it’s the only data that resolves the routing question for your stack.
Sonnet 4.8 is expected. Anthropic didn’t ship it at Code with Claude SF on Wednesday morning, but the npm-leaked references and the Code with Claude London (May 19) plus Tokyo (June 10) timelines mean a Sonnet 4.8 release is plausible inside the next six weeks. If your routing decision today is borderline on the price side, hot-pin Sonnet 4.6 and re-run the eval the day Sonnet 4.8 launches. We covered the SF launch shape in our same-day Code with Claude recap.
The output-style split is durable but not stable. OpenAI is iterating on tone and formatting actively — the Tuesday “less yappy” tuning is itself a response to GPT-5.3 community feedback. Don’t make a 12-month routing call on a one-week output style. The right cadence is a quarterly re-eval of style fit against your actual output destinations.
The memory-and-Gmail feature is on by default for most users; that’s a policy story, not a routing story. Treat it as an org-wide DLP and identity decision separate from your model-routing decision. Don’t conflate the two.
GPT-5.5 standard pricing math gets worse fast for high-traffic workloads. The 1.9× delta on tokens compounds: a 100M-token-per-month team that switches its full agent fleet from Sonnet 4.6 to GPT-5.5 standard adds roughly $1,500 per month on output alone before traffic growth. The “just standardize on one vendor” simplification is real but it’s not free — model your actual traffic before consolidating.
The bottom line
Tuesday’s launch is real, the hallucination targeting is meaningfully different from anything OpenAI has shipped before, and for the three named verticals (law, medicine, finance) GPT-5.5 Instant is now a tested-against alternative to Sonnet 4.6. For everything else — high-volume agent loops, prose output, codegen outside Codex, document review — Sonnet 4.6’s pricing edge holds.
The actual move this quarter isn’t a vendor consolidation. It’s a workload-by-workload routing decision: keep Sonnet 4.6 as the default workhorse, reserve GPT-5.5 Instant for the vertical-tuned high-stakes work, and reserve GPT-5.5 standard for the long-context bulk-document work where Sonnet’s window can’t hold the input. Two lanes, one consciously hybrid stack.
If you’re an engineer who needs to actually evaluate this on your own prompts — the only data that resolves the routing call — our Evaluating AI Models course walks through the 50-prompt eval pattern, scoring rubrics for vertical accuracy, and the cost-impact spreadsheet at typical traffic shapes. It’s the playbook the rest of this post depends on.
Sources
- Introducing GPT-5.5 Instant — OpenAI
- OpenAI releases GPT-5.5 Instant — TechCrunch
- GPT-5.5 Instant rolls out to ChatGPT — Axios
- GPT-5.5 Instant smarter and concise — SiliconANGLE
- GPT-5.5 vs Claude Sonnet 4.6 model comparison — Artificial Analysis
- Claude Sonnet 4.6 vs GPT-5.5 detailed comparison — LLM Stats
- Claude Sonnet 4.6 vs GPT-5: 2026 Developer Benchmark — Sitepoint
- GPT-5.5 Instant nixes gratuitous emojis — 9to5Mac
- GPT-5.5 Instant launches as ChatGPT default — Testing Catalog