Frontier Firms Use 3.5x More AI Per Worker: A 10-Minute Eng Manager Self-Assessment

OpenAI's B2B Signals report dropped a 3.5x/16x benchmark. Here's how to calculate where your team actually sits — without overinterpreting an OpenAI-only chart.

On May 6, 2026, OpenAI published the first edition of B2B Signals, a quarterly research report based on privacy-preserving, aggregated telemetry from enterprise OpenAI usage. The headline number that landed in every eng manager’s Slack within 24 hours: frontier firms — companies at the 95th percentile of usage — use 3.5× as much AI intelligence per worker as typical firms, up from 2× a year ago.

The deeper finding is sharper. Most of the gap isn’t volume — it’s depth. Message volume explains only 36% of the frontier advantage; the remaining 64% comes from richer, more complex AI use. And the single most extreme metric in the report: frontier firms send 16× as many Codex messages per worker as typical firms.

Those three numbers are going to show up in every Q3 board deck where a CTO has to defend their AI investment, and in every Q4 OKR-setting conversation where an eng manager has to argue for or against hiring. The number is doing real work in the world. So is the methodology, and that’s where it gets interesting.

This piece walks you through what OpenAI is actually measuring (tokens, not messages), how to calculate the version of this number for your own team, the four honest caveats that should appear on the board deck slide next to the 3.5×, and what to do this quarter regardless of where you land.

What “Frontier Firm” Actually Means in the Report

OpenAI’s definition of frontier firm is specific. It’s the 95th percentile of enterprise customers measured by tokens per worker, where tokens are OpenAI’s proxy for “intelligence demanded.” A LinkedIn explainer from OpenAI Chief Economist Ronnie Chatterji frames it as: “frontier firms… at the 95th percentile … are using 3.5 times more tokens per employee than the median.”

That phrasing matters because tokens are not the same as messages. The report breaks the 3.5× gap into two components:

  • 36% of the gap is explained by activity — how many messages per worker.
  • 64% of the gap is explained by depth — tokens per message: longer context, richer prompts, more substantive outputs, more agentic / Codex-driven workflows.

Translation for eng managers: frontier teams aren’t sending way more prompts. They’re sending prompts that are doing way more work each. A single Codex agentic workflow that spawns a multi-file refactor, runs tests, and opens a PR is one message — but it’s a 50,000-token message, not a 200-token one. That’s the entire gap.

The 16× Codex number is the load-bearing detail. Frontier firms are letting AI take multi-step action, not just respond to a one-line question. That’s where the depth advantage compounds.

The 5-Metric Self-Assessment

You can compute a directional version of “where does my team sit” with five numbers, most of which your Claude Cowork analytics, GitHub Copilot dashboard, or ChatGPT Enterprise admin reports already export. Plug each into the band below.

MetricWhat it measuresWhere to get it50th-percentile typicalFrontier-shape (estimated)
1. Messages per worker per weekActivity floorVendor admin dashboard8–1530+
2. Avg tokens per messageDepth proxy — are prompts substantive or one-line?Sample your team’s prompts; estimate200–5002,000–10,000
3. % of engineers using Codex / Claude Code / Cursor agentic mode weeklyAgentic adoptionIDE plugin usage report25–40%80%+
4. Agentic messages per engineer per weekThe 16× signalCodex usage report1–315–25
5. % of team that touched AI in the last 14 daysCoverage floorAdmin dashboard MAU60–75%95%+

These percentile bands are not OpenAI’s official numbers — OpenAI does not publish absolute thresholds, only the relative gap. The bands above are estimates the FindSkill team derived from triangulating the Resume Genius 2026 survey, the McKinsey 2025 State of AI, the Greenhouse 2026 AI in Hiring Report, and the GitHub Octoverse 2025 data. They will calibrate over time as more enterprises publish their own numbers.

Once you have the five numbers:

  • All five in the typical band: you are exactly the median enterprise on OpenAI’s chart. There is no crisis. There is also no moat.
  • Three or four in the typical band, one or two below: the bottleneck is wherever the lowest number lives — usually metric 4 (agentic adoption) or metric 2 (depth-of-prompt).
  • Three or more in or near the frontier band: you are quietly running at frontier shape on OpenAI’s chart. That number belongs in your Q3 board deck and is a defensible hiring-ROI argument.

Why the 16× Codex Gap Is the Number That Matters

The 3.5× headline is the press story. The 16× developer-cohort gap is the eng-mgr-actionable story.

If you accept the 16× number — and the underlying Pro Logica observation on X is the most-shared frame on it, that “Access to AI is table stakes. Depth of integration is the moat” — three implications for Q3 hiring and team composition fall out.

Implication 1 — Hire for AI fluency, not AI tolerance. Your interview filters need to specifically test multi-turn agent-debugging skills: can the candidate iterate with Codex / Claude Code / Cursor across a multi-file refactor, defend why they accepted or rejected each suggestion, and explain what failed when the agent stalled? Resume keywords for “uses Copilot” are now a floor, not a ceiling. (And see the companion piece on redesigning behavioral interviews for AI-prepped candidates — the redesign work is calibrated to surface exactly this kind of fluency.)

Implication 2 — Don’t pay seat licenses for non-users. If 30% of your engineering team falls below the 50th percentile of Codex usage, those seats are functionally idle. Consider tiered access — full agentic licenses for the top half, read/explain access for the rest — and route the savings into training the bottom half rather than into more seats. The GitHub Copilot June 1 usage-based billing change makes this calculation explicit; you’re being priced on usage now, not seats.

Implication 3 — Team density compounds. Putting all your AI-fluent engineers on one squad creates a frontier-velocity team while the other squads stay typical. Manage that intentionally. The fast-team / slow-team divergence is a real morale and politics problem — frontier teams will outproduce slow teams by 2–3×, frustrate them, and either spread the practice or burn out trying.

Four Caveats Your Board Deck Should Carry

The 3.5× is real, but it’s measured on OpenAI products only. Four honest caveats your board slide should include in 8-point font at the bottom.

Caveat 1 — It’s an OpenAI-only chart. OpenAI literally cannot see your team’s Claude usage, your team’s Gemini usage, your team’s Cursor usage, or your team’s Llama-on-Bedrock usage. The Ivris Tech analysis was direct: “OpenAI cannot measure what happens on Anthropic, Google, or Microsoft stacks. The 95th percentile cohort may be partially a selection artifact.” If your team is heavy on Claude Code or Cursor for the multi-step agentic work, you can be at the 95th percentile of AI usage broadly and look median or below on OpenAI’s chart.

Caveat 2 — Selection bias is plausible. Enterprises with heavy OpenAI usage are an opt-in cohort. Frontier-shape adoption on OpenAI ≠ frontier-shape adoption on AI broadly, because the frontier shape on Anthropic, Google, and Cursor has its own distribution that this report does not measure.

Caveat 3 — The methodology details are not fully published. The 36%-activity / 64%-depth decomposition is reported as a headline number; the econometric model (controls, sample composition, time-window) is not in the public version of the report. That doesn’t make the number wrong; it makes it less defensible than it looks in a single-slide format.

Caveat 4 — Vendor publishing its own benchmarks creates incentive misalignment. The AI Builders Digest take“OpenAI is essentially publishing the sales deck for their new services company” — names the conflict honestly. The same week as B2B Signals, OpenAI announced the Deployment Company, whose pitch deck depends on enterprises feeling behind on AI adoption. The numbers aren’t fabricated, but they are selected.

None of this disqualifies B2B Signals as evidence. It does mean the board slide footnote should read: “OpenAI B2B Signals (May 2026); aggregated OpenAI usage only; full methodology not public.”

The OpenAI B2B Signals report on openai.com showing the frontier-firm advantage decomposition Source: How Frontier Firms Are Pulling Ahead — OpenAI B2B Signals, May 6, 2026.

What to Do This Quarter, Regardless of Where You Sit

The honest playbook splits into three tracks. Pick the one that matches your current numbers.

If you’re at or below the 50th percentile (typical): focus on metric 4 (agentic adoption). Run a 4-week internal sprint where every engineer ships at least three Codex / Claude Code / Cursor agentic PRs. The point is not the PRs; it’s that the team builds the muscle memory for what an agentic workflow feels like. Once that’s in place, metrics 2 and 3 move on their own.

If you’re in the middle (60th–80th percentile): focus on depth, not volume. Audit a representative week of your team’s prompts. If the median prompt is under 300 tokens and the median session is under 5 messages, you have a depth problem. Ship internal documentation on the depth/agentic patterns — multi-file refactors, full-test-and-PR loops, codebase Q&A with context — and benchmark before/after.

If you’re at or near frontier shape (90th+ percentile): stop chasing the metric and start measuring outcomes. The whole point of the depth-not-volume framing is that activity correlates with effort but not with results. Switch your weekly engineering metric from “AI messages per worker” to “AI-assisted shipped artifacts that survived 30 days in production.” That’s the metric the next quarter’s B2B Signals report will probably start measuring.

The Bottom Line

The 3.5× headline is real, useful, and partial. The 16× Codex finding is the actionable one — agentic depth is where the frontier advantage compounds, and the cheapest way to close the gap is to ship the muscle memory across your engineering team rather than to buy more seats.

Use B2B Signals as evidence, not as a verdict. The board slide footnote — OpenAI-only data, public methodology partial, vendor-published — is non-optional. The eng-manager moves below the slide are where the work actually happens.

If you want a structured walk through the Q3 rollout work — the agentic-adoption sprint, the depth-vs-activity audit, the seat-license rationalization, and the AI-assisted-shipped-artifact metric — the Enterprise AI Rollout Playbook course on FindSkill walks through it. The first two lessons are free.

Sources

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume