Copilot Critique: Why Microsoft Lets Claude Fact-Check GPT

Microsoft just admitted that one AI isn’t enough.

Their new Copilot Researcher update lets GPT write a research report — then hands it to Claude for a fact-check before you ever see it. Two rival AI models, owned by competing companies, working together inside the same product. And the accuracy improvement isn’t small: 13.8% better than any standalone AI research tool on the market.

This is either the future of enterprise AI or the most expensive game of telephone ever invented. Let’s dig into how it actually works.

What Is Copilot Researcher?

Quick context if you’re not in the Microsoft ecosystem. Copilot Researcher is a built-in AI research tool inside Microsoft 365 Copilot — the $30/month add-on that brings AI into Word, Excel, PowerPoint, Outlook, and Teams. Researcher specifically handles deep research tasks: you ask it a complex question, and it searches the web, reads sources, and writes a multi-page report with citations.

Think of it as having a research assistant who reads 50 articles and writes you a summary. Before this update, that assistant was GPT working alone. Now there are two assistants — and they check each other’s work.

Critique Mode: GPT Writes, Claude Reviews

Here’s how Critique works in practice:

You ask a research question. Something like “Compare the top 5 CRM platforms for mid-size manufacturing companies, including pricing, implementation timelines, and customer satisfaction data.”
GPT does the research. It searches the web, reads sources, plans the report structure, and writes a full draft with citations.
Claude reviews the draft. Before you see anything, Anthropic’s Claude model independently checks the report for factual accuracy, source reliability, citation quality, and completeness. It flags weak claims, shaky evidence, and gaps.
The report gets refined. Based on Claude’s review, the report is strengthened before delivery.

The whole thing happens automatically. You don’t pick which model does what. You just get a better report.

And here’s the part that makes this interesting: Microsoft gave a competitor’s model — Claude, built by Anthropic — final editorial authority over whether its own $13 billion partner’s output is good enough to ship.

Council Mode: Two Reports, One Comparison

Council takes a different approach. Instead of one-writes-one-reviews, two AI models each produce a complete, independent report on the same question. Then a third model writes a “cover letter” comparing them — highlighting where they agree, where they disagree, and what unique insights each one brings.

You get both full reports plus the comparison. So if GPT emphasizes pricing data and Claude focuses on implementation risks, you see both perspectives and can decide which matters more for your situation.

It’s like getting a second opinion from a different doctor — except both doctors work in the same office and someone wrote you a summary of how their diagnoses differ.

The Numbers: 13.8% Better Than Everything Else

Microsoft tested Critique against the DRACO benchmark — a standardized test for AI research quality that includes 100 complex research tasks across 10 different domains. DRACO was originally created by Perplexity, which makes this next part a bit ironic.

Copilot Researcher with Critique scored 57.4 on DRACO. The previous best? 50.4 — held by Perplexity’s own Deep Research tool. That’s a 13.8% improvement, or 7.0 points.

Here’s where it gets really telling:

System	DRACO Score
Copilot Researcher + Critique	57.4
Perplexity Deep Research	50.4
Standalone Claude Opus 4.6	43.3
Standalone GPT o3	42.7

Neither GPT nor Claude is impressive alone. GPT scores 42.7. Claude scores 43.3. But stack them together — one writing, one reviewing — and suddenly you’re at 57.4. The gap only opens when they work as a team.

The biggest improvements came in three areas:

Breadth and depth of analysis (+3.33 points)
Presentation quality (+3.04 points)
Factual accuracy (+2.58 points)

That last one matters most. Factual accuracy is exactly what enterprise customers lose sleep over — polished reports that look great but rest on weak evidence. Critique’s whole design targets that problem.

Why This Matters (Even If You Don’t Use Copilot)

This announcement tells us something bigger about where the entire AI industry is heading.

The single-model era is ending. For the past three years, the AI race has been “my model is smarter than your model.” GPT-5 vs Claude 4.6 vs Gemini 3. Each company trying to build the one model that wins everything. Microsoft just said, publicly, that the winning strategy isn’t one perfect model — it’s making different models work together.

The orchestration layer is becoming the product. Microsoft isn’t trying to build the best AI model anymore. They’re building the best system for combining everyone else’s models. That’s a huge strategic shift. As one tech analyst put it: Microsoft is positioning itself as “the interface layer for enterprise AI — monetizing subscriptions and inference demand without needing to win every frontier model race.”

This validates what power users already do manually. If you’ve ever pasted a ChatGPT response into Claude and asked “is this accurate?” — you’ve been doing Critique mode by hand. Microsoft just automated it and proved it works 13.8% better than either model alone.

What It Can’t Do

Let’s be honest about the limits.

It’s enterprise-only. You need a Microsoft 365 Copilot license ($30/user/month) plus enrollment in the Frontier early-access program. Your IT admin also has to enable third-party model access. This isn’t something you can try with a personal Microsoft account.

25 queries per month. That’s the current cap for both Critique and Council. For heavy researchers, that’s about one query per workday. Fine for important reports. Not enough if you want to use it for everything.

It’s still AI research, not human research. Critique reduces hallucinations and weak citations — it doesn’t eliminate them. Claude can catch obvious errors, but it can’t verify claims that require domain expertise or access to proprietary databases. The report still needs a human reviewer for anything high-stakes.

It’s slow compared to regular Copilot. Running two models in sequence takes longer than running one. Microsoft hasn’t published speed benchmarks, but expect Critique reports to take noticeably longer than standard Researcher output.

The “Perplexity already did this” critique is fair. Several observers pointed out that Council mode looks a lot like Perplexity’s multi-source approach — just scaled to enterprise. Microsoft beating Perplexity’s own benchmark using a similar method is a bit like copying someone’s homework and getting a better grade.

How to Access Critique and Council

Right now, both features are available through Microsoft’s Frontier program — their early-access initiative for enterprise customers driving large-scale Copilot adoption.

Requirements:

Microsoft 365 Copilot license ($30/user/month)
Frontier program enrollment (talk to your Microsoft account rep)
IT admin must enable third-party model access (for Claude)
Open Copilot Chat → Tools → Researcher → Select “Auto” from the model picker

If you can’t see the model picker options, check with your IT admin. The feature needs to be enabled at the tenant level.

Critique vs. Doing It Yourself

Here’s the question nobody on social media is asking: can you just do this manually?

Approach	Cost	Effort	Quality
Copilot Critique	$30/mo + Frontier	Automatic	57.4 DRACO
Manual: ChatGPT → paste into Claude	$20-40/mo (two subscriptions)	10-15 min per report	Varies (you’re the orchestrator)
Single AI tool	$20/mo	Quick	~42-43 DRACO

The manual approach works. But it’s tedious, you have to figure out what to ask Claude to check, and you won’t get the structured review that Critique provides. For one-off research? Two browser tabs is fine. For daily enterprise research across a team of 50 people? That’s where Critique earns its keep.

The Bottom Line

Microsoft just proved that two AI models arguing with each other produce better work than either one thinking alone. That’s not a small insight — it changes how we should think about AI accuracy.

For enterprise teams already paying for Copilot, getting into the Frontier program to try Critique is a no-brainer. The 13.8% accuracy improvement is real, and it specifically targets the thing that matters most: whether you can trust the report enough to act on it.

For everyone else, the takeaway is simpler but just as useful: if you’re making important decisions based on AI-generated research, get a second opinion. From a different AI. The models are surprisingly good at catching each other’s mistakes — and surprisingly bad at catching their own.

Sources:

Copilot Critique: Why Microsoft Lets Claude Fact-Check GPT

Table of Contents