OpenAI shipped three new realtime voice models yesterday afternoon. GPT-Realtime-2 with GPT-5-class reasoning. GPT-Realtime-Translate doing live 70-language translation at $0.034 a minute. GPT-Realtime-Whisper streaming transcription at half a cent. The Realtime API moved to GA with MCP, image inputs, and SIP phone calling.
Anthropic, meanwhile, just spent two days on stage at Code with Claude San Francisco and did not announce a single voice product. Not a model. Not an API. Not a roadmap line. The keynote opener even said the quiet part out loud: “No new model today. Today is about how we are making our products work better for you.”
So if you’re an engineering manager picking a voice stack this quarter — or a small team that’s been waiting for Anthropic to catch up — yesterday changed the math. Here’s the honest decision frame.
What actually shipped on May 7
Three models landed in the OpenAI API at the same time. They’re related but they do different jobs.
GPT-Realtime-2 is the headline. It’s a speech-to-speech model — audio in, audio out, no separate transcription step in the middle. The new bit is GPT-5-class reasoning living inside the voice loop. Context window jumped from 32K to 128K. You can dial reasoning effort from minimal to xhigh, the same way you would with a text model. It supports preambles (“let me check that for you”) and parallel tool calls with audible status updates, so the user hears the agent thinking rather than waiting in silence. There’s also a stop-until-wake-word mode for ambient deployments. Pricing is $32 per million audio input tokens, $0.40 for cached input, and $64 per million audio output tokens.
GPT-Realtime-Translate is a single model that handles 70+ input languages and translates into 13 output languages, live, while the speaker keeps talking. $0.034 per minute. This is the model that quietly kills the four-vendor stack most international support teams have been duct-taping together.
GPT-Realtime-Whisper is a streaming speech-to-text model. $0.017 per minute. Transcription that keeps pace with the speaker.
On top of that, the Realtime API itself moved to GA. You now get remote MCP-server support inside voice sessions, image inputs (your agent can see what the user shows it), and SIP integration so you can wire it to a regular phone number.
OpenAI named three production customers live: Zillow runs voice agents for housing appointments. Priceline uses it for hotel booking. Deutsche Telekom runs multilingual customer support on it.
What Anthropic showed at Code with Claude SF — and what it didn’t
The Code with Claude SF event ran May 6-7. Here’s what Anthropic actually shipped:
- Doubled rate limits for Claude Code on Pro, Max, and Enterprise
- Multi-agent orchestration moved to public beta (the lead-agent / sub-agent pattern)
- Outcomes feature in public beta (declarative success criteria for agents)
- Dreaming in research preview (agents that learn from their own sessions)
- Code Review, Remote Agents, CI auto-fix, Security Reviews
- Claude Code Routines (higher-order prompts)
- Claude Design (visual design capabilities in Opus 4.7)
- The SpaceX Colossus 1 infrastructure partnership
What’s missing from that list is the part that matters today. No voice model. No voice API. No production voice agent story. No “we’re working on it.” No date for London on May 19 either.
Two days of stage time, six months after Mike Krieger told Bloomberg the consumer push was a strategic priority, and voice didn’t make the cut. That’s a real signal, not a slow news week.
The 5-question Q3 routing frame
If you’re picking a voice stack this week, these five questions decide it. Run them in order.
1. Does your agent need to pause-on-tool-call?
GPT-Realtime-2’s preambles are the headline UX feature, but they’re a design opinion, not a free upgrade. The model speaks “let me check that” out loud while it runs a tool, and it streams parallel tool-call status as audible updates.
For a customer-support agent looking up an order, that’s a step change — silence used to read as the agent being broken. For a clinical voice agent or a financial-disclosure agent where the user expects deliberate quiet while the system verifies, the preambles are noise you’ll spend a sprint suppressing.
If you want silence on tool-call, configure preambles off and budget two days to tune the prompt around it. If you want the audible feedback loop, you’re already at the easiest path on the market today.
2. What’s your reasoning depth ceiling?
The reasoning effort dial goes from minimal to xhigh. Minimal is ChatGPT’s old voice model — fast, shallow, cheap. xhigh is GPT-5-class reasoning with audio.
Match the dial to the call type:
- FAQ deflection, password resets, appointment scheduling: minimal
- Multi-step booking with constraints, support escalation routing: medium
- Clinical intake, legal triage, complex financial questions: high
- xhigh: only when latency budget allows 4+ seconds of reasoning
If your agent needs xhigh on every call, you’re paying for tokens that the user won’t tolerate latency-wise. The realistic production pattern is minimal voice front-end + a text reasoning model in the background, which we’ll come back to.
3. Are you multilingual-required?
This is where Translate quietly redraws the map. The standard international support stack today is Whisper for STT, DeepL or Google Translate for translation, Claude or GPT for reasoning, ElevenLabs or Cartesia for TTS. Four vendors, four contracts, four audit trails, four prompt caches, four latency budgets that stack to 800-1500ms end-to-end.
Translate compresses that into a single $0.034-per-minute API call. For a small support team running 1,000 minutes a day, that’s $34 daily. For a 5,000-call mid-market team, $510 daily. Both numbers are dramatically below what the four-vendor stack actually costs once you add integration engineering.
Two gates before you migrate. First, check the 13-output-language list against your support footprint — if you serve Vietnamese, Thai, or Indonesian customers, those aren’t on the output list at launch and you’ll need to verify support for your specific pairs. Second, if you have a HIPAA, EU AI Act, or India DPDP residency requirement, get OpenAI’s data-handling story documented before you cut over.
4. Are you Anthropic-locked on the rest of your stack?
This is the routing question Anthropic-anchored teams need to answer honestly. If your retrieval, your tool routing, your prompt caching, and your audit trail all run on Claude — there’s a real bridging cost to running OpenAI on the voice path while keeping Claude on text. Estimate three to five sprint-weeks for a small team to wire it cleanly: separate prompt caches, separate observability, separate tool-permission scopes, separate evaluation harnesses.
The bridge is buildable. We’ve seen teams ship it. But if you went into this week assuming “Claude does it all” was a viable bet through Q3, yesterday’s silence means that assumption needs reworking.
5. Are you actually waiting for Anthropic’s voice answer?
Code with Claude London is May 19. Tokyo is June 10. If voice is going to ship, those are the most likely venues — but May 19 is eleven days from now, and Anthropic’s track record this year on previewed-then-shipped features is mixed.
Holding eleven days for a maybe-launch is the riskier bet for most teams. You burn the production-voice first-mover window in your industry, and if Anthropic does ship in London, you can still migrate later — voice models are abstraction-friendly enough that switching costs are real but not prohibitive.
The honest call: ship on GPT-Realtime-2 today. If London delivers, evaluate then. If London doesn’t, you’re already in production while Anthropic-locked teams are still in planning.
What this means for you
If you’re a solo developer or a 2-3 person team
Build on GPT-Realtime-2 with minimal effort. Use the WebRTC quickstart. Skip remote MCP for the first version — it’s a Q2 cost optimization, not a v1 requirement. Target two weeks to first production deployment.
If you’re at a 10-50 person engineering team
Run the 4-vendor head-to-head this week: GPT-Realtime-2, Cartesia, ElevenLabs, and your current Whisper + LLM hand-roll. Pick the call type that’s most painful (multilingual support if you have it; FAQ deflection if you don’t) and do a one-week pilot with 5% of traffic. The pricing math will decide for you.
If you’re at an enterprise with regulated voice workflows
Wait. GPT-Realtime-2 is genuinely production-ready for unregulated voice, but the audit-trail story for clinical, legal, or financial voice — where every word the model speaks needs to be replayable, attributable, and compliant — is still maturing. Pilot in non-regulated departments (HR triage, internal IT helpdesk, vendor-management support) and let the regulated workflows wait until the audit tooling catches up.
If you’re an engineering manager with a Claude-anchored stack
Your call this week: bridge cost vs. wait cost. The bridge is three to five sprint-weeks. The wait is at least eleven days for London with no guarantee. If your voice use case is high-leverage (top-3 cost line in support, or a revenue-generating outbound voice flow), bridge now. If it’s a Q4 nice-to-have, wait through London and decide on May 20.
If you’re a multilingual support team running the 4-vendor stack
You’re the team for whom yesterday changed the most. The integration tax on the Whisper + DeepL + Claude + ElevenLabs stack is the kind of thing your engineering org has been quietly carrying for 18 months. Translate is the consolidation play. Run the cost math against your actual call volume this week — it will not be close.
What this can’t fix
Five honest limits.
It still hallucinates. GPT-5-class reasoning in the voice loop doesn’t make the model factually grounded. It makes the model speak fluently while making things up. Every production voice agent needs retrieval grounded in your data and a fallback path when retrieval misses. Don’t ship a voice agent that has no fallback to a human, an email, or a web search.
Latency under load isn’t yet stress-tested. Day-1 launches have always shown clean latency numbers; the real stress test happens when r/OpenAIDev users start posting throughput data over the next 14 days. If your agent is on a SLA, run a synthetic load test before you commit a customer to it.
The 13-output-language list is short. Vietnamese, Thai, Indonesian, Tagalog, Hindi, and most African languages aren’t on it at launch. If your support footprint includes those, Translate doesn’t replace your stack yet.
SIP integration debugging will be painful. OpenAI’s SIP integration is new on Day 1. The first-week throughput-vs-promised reports will tell you whether to wire your phone number to it now or wait two weeks for the stability patch wave.
The audit trail story is incomplete for regulated industries. Voice replay, attribution, and compliance evidence require tooling that doesn’t ship as part of the API. You’ll need to build (or buy) the recording layer, the tagging layer, and the redaction layer separately.
What the community is actually saying (May 7-8 sentiment)
The launch is the most viral OpenAI announcement of the past month on X. The official OpenAI launch post hit 11,300+ likes, 1,000+ reposts, 4,000+ bookmarks within the first 18 hours, and the GPT-5-class reasoning + 128K context jump (up from 32K) is what most developer threads are calling the bigger deal — not the headline pricing. Pricing reactions split: most developers see the per-token model as “expensive but acceptable” given the bundled reasoning + tool calling, but a meaningful minority is calling out a roughly 4-5× gap versus xAI’s voice equivalents and arguing that ~$0.24/min for a fully bundled session is “still too high for production all-rounder voice agents.” Specialized use cases (translation, customer-support deflection) are getting the most enthusiastic reception; cost-sensitive high-volume use cases are getting the most pushback.
The vendor landscape comparison is sharper than the launch post implies once you normalize the math. ElevenLabs’ API TTS sits in the $60-120 per million characters range, with the business “Conversational AI” tier quoted around 13,750 minutes for the standard plan (~$0.08/min equivalent). Deepgram’s Aura-2 TTS is around $0.03/min; the “Voice Agent API” product markets at $4.50 per hour bundled with the LLM included. Cartesia is in the same premium tier, recent benchmarks listing roughly $40 per 1M characters in their May 2025 model. Vapi and Retell sell full-stack agents (telephony + agent orchestration + LLM pass-through) and sit in the mid-pack on per-minute price — convenient, but typically more expensive than direct GPT-Realtime-2 use once you factor in the LLM. The directional read: GPT-Realtime-2 lands on or below the equivalent table for most use cases when you treat it as the LLM and the voice layer combined, especially with prompt caching aggressively used. The exception is high-volume FAQ deflection where ultra-low-latency specialists like Cartesia (~90ms) still win on latency-sensitive metrics, and platforms like Bland that compete on raw lowest-cost-per-minute.
On the Anthropic-voice-silence question, there are quiet signals — but no hard confirmation — that voice work is in progress at Anthropic. Several analyst reads have pointed at Code with Claude London on May 19 as the most plausible venue if any voice product ships, while acknowledging the SF event’s deliberate “today is about how we are making our products work better for you” messaging suggests Anthropic is pacing rather than holding. The honest read for teams making a Q3 call this week: don’t bet on London delivering voice. If it does, you can migrate. If it doesn’t, you’ve already shipped.
The bottom line
Yesterday wasn’t just a voice-model launch. It was OpenAI claiming the production voice category while Anthropic decided to spend its biggest dev event of the year on text agents and infrastructure deals. That’s a strategic call by Anthropic — voice may not be where their next bet pays out — but for the team picking a stack this week, it makes the routing decision easier than it’s been in twelve months.
If voice is on your Q3 roadmap, the answer today is GPT-Realtime-2. Pilot small, watch the 14-day production reports, and re-evaluate after Code with Claude London on May 19. If voice isn’t on your Q3 roadmap, yesterday was still the moment the category quietly went from “wait and see” to “production-ready for most use cases” — which means it’s probably time to ask whether it should be.
Want a deeper walkthrough on building production voice agents from scratch? Our AI Voice Agents course covers the architecture patterns, latency budgeting, and tool-routing decisions step by step. If you’re trying to decide between Claude and ChatGPT for your broader stack, Claude vs ChatGPT has the full side-by-side. And for teams going wider into audio — TTS, voice cloning, audio analysis — AI Voice and Audio covers the toolbox.
Sources
- Introducing gpt-realtime and Realtime API updates for production voice agents — OpenAI
- Advancing voice intelligence with new models in the API — OpenAI
- OpenAI launches new voice intelligence features in its API — TechCrunch
- OpenAI has new voice models that reason, translate, and transcribe as you speak — 9to5Mac
- GPT-Realtime-2: A Voice Model with GPT-5-class Reasoning — DataCamp
- Code with Claude San Francisco — Anthropic
- Live blog: Code w/ Claude 2026 — Simon Willison
- Anthropic Release Notes — May 2026 — Releasebot
- OpenAI Realtime API: Production Voice Agents 2026 — Forasoft
- OpenAI unveils trio of realtime audio models — Neowin