The 5 Domains on the Claude Certified Architect Exam

What each domain on the Claude Certified Architect (CCA-F) exam actually tests, with weights, real examples, and which to study first.

If you only read the Anthropic exam guide for the Claude Certified Architect, you’d think the five domains are roughly equal weight — five neat boxes, each with a percentage. They’re not. Two domains together make up nearly half the exam. One of them is the single biggest reason candidates with strong prompt-engineering chops still fail.

This post walks through each of the five CCA-F domains in order of how much of the exam they actually take up — what they test, where they show up in real production work, and which one you should study first based on your background.

The Five Domains, by Weight

Here’s the actual exam blueprint. Memorize the weights — they should drive your study time allocation more than anything else.

#DomainWeightWhat it really tests
1Agentic Architecture & Orchestration27%Designing systems where the model decides what to do
2Claude Code Configuration & Workflows20%The CLAUDE.md / hooks / plan-mode / MCP wiring layer
3Prompt Engineering & Structured Output20%Writing prompts and tool descriptions that don’t drift
4Tool Design & MCP Integration18%How tools should expose themselves to a model
5Context Management & Reliability15%What survives when conversations get long

Notice that domains 1 and 2 together account for 47% of the test. Domains 3, 4, and 5 together: 53%. If you imagine a candidate who’s a brilliant prompt engineer but has never built an agentic system or wired up Claude Code at scale, they’re competing for somewhere around half the available points. That’s why the most-shared piece of advice from passers is some variation of “this is not a prompting fluency test.”

Let’s go through each domain, in weight order.

Domain 1 — Agentic Architecture & Orchestration (27%)

This is the heaviest-weighted domain on the exam, and it’s where the most candidates lose points. The core question this domain answers: “Given a problem, what’s the right shape of agentic system to solve it, and how do its parts coordinate?”

What it tests:

  • When to use a single Claude call vs. when to spin up subagents
  • Subagent isolation — they don’t share context with the parent agent unless you explicitly pass it
  • Coordinator vs. worker patterns
  • When to escalate to a human (and how the system knows it should)
  • Multi-step plans, plan revision, and recovery from a step that fails
  • The Agent SDK vs. Claude Code as platform choices

Where it shows up in real work:

Any time you build a system that does more than one thing in sequence, you’re in this domain. Examples: a research bot that searches, reads, summarizes, and drafts — those are four logical steps that may or may not warrant subagents. A code-review agent that runs tests, parses failures, and proposes fixes — same idea. A customer support resolution agent that gathers context, decides whether to refund or escalate, and writes the reply.

The exam is full of scenarios like “Your agent occasionally hallucinates customer order numbers. Which architectural change is most likely to reduce this?” Wrong answers focus on prompt tweaks. Right answers usually involve a structural change — a tool call that retrieves real order data, a hard gate that rejects responses without a verified order ID, or a subagent whose only job is verification.

Concrete example question pattern:

You have a multi-agent research system where a coordinator dispatches tasks to subagents. One subagent finds and summarizes academic papers, another scans news, and a third synthesizes. The synthesizer keeps making claims that don’t appear in either subagent’s output. What’s most likely wrong?

The wrong-tempting answer is “better prompt for the synthesizer.” The right answer is structural: subagents don’t share context with each other, and the synthesizer is probably hallucinating because it isn’t receiving the subagents’ outputs in a structured form it can ground answers in. The fix is at the orchestration layer, not the prompt layer.

Why people lose points here: Treating every problem as a prompt problem. The exam keeps testing whether you can recognize when the right move is architectural — adding a tool, adding a gate, adding a subagent — rather than tweaking words.

Domain 2 — Claude Code Configuration & Workflows (20%)

This domain is the operational layer of building with Claude. It’s heavily weighted because Anthropic has made Claude Code the canonical environment for agentic development, and the exam reflects that.

What it tests:

  • The three-level CLAUDE.md hierarchy (project, user, enterprise) and how they merge
  • Plan mode mechanics — when it activates, what it can and can’t do
  • Hooks (pre-tool-use, post-tool-use, user-prompt-submit) and what they’re for
  • Permissions and the allow/deny logic for tools and commands
  • Slash commands and how to expose internal workflows as commands
  • MCP server registration and discovery
  • Sub-agent invocation from inside Claude Code
  • Session management — what persists, what doesn’t, what resumes

Where it shows up in real work:

Any time you’re configuring Claude Code for a team. Setting up CLAUDE.md so juniors and seniors get the right level of guidance. Wiring an MCP server that wraps your internal database. Writing a hook that blocks accidental commits to main. Building a slash command that runs your team’s deploy workflow.

Concrete example question pattern:

A developer’s user-level CLAUDE.md says “always run tests before committing.” The project-level CLAUDE.md (in the repo) says “skip tests on doc-only changes.” The team’s enterprise CLAUDE.md says “all commits must be GPG signed.” A developer makes a doc-only change and tries to commit. What does Claude Code do?

The wrong answer is “refuses to commit because user-level rule blocks it.” The right answer requires you to know the precedence: project rules can override user rules in their scope, enterprise rules supersede both for security-relevant directives. So Claude Code skips tests (project override), enforces GPG signing (enterprise rule), and lets the commit through.

Why people lose points here: Memorizing CLAUDE.md as one thing instead of three layers. Several passers reported they went into the exam thinking the precedence was simpler than it is. The exam loves to construct conflict scenarios.

Domain 3 — Prompt Engineering & Structured Output (20%)

This is the domain most people think they know. It’s also the domain where the gap between “good at prompting” and “passes CCA-F questions” is widest.

What it tests:

  • XML tag usage and when it improves vs. hurts performance
  • System prompt construction for production use cases
  • Tool descriptions that don’t leak implementation details
  • JSON schema design for structured output
  • Output validation and how to handle schema violations
  • Few-shot examples and when they hurt instead of help
  • Chain-of-thought patterns and when not to use them
  • Anti-patterns — like asking the model to “be careful” or “don’t hallucinate”

Where it shows up in real work:

Every time you write a system prompt, define a tool, or specify a JSON output schema. The questions in this domain are heavily focused on tool descriptions and structured output, because those are where most production reliability problems live. Bad tool description = model calls the tool at the wrong time. Loose JSON schema = model returns malformed output that breaks downstream parsing.

Concrete example question pattern:

You’re defining a tool called get_customer_orders. Which of these descriptions would most reliably guide the model to call this tool only when appropriate?

  • A) “Retrieves customer order data from our Postgres database. Uses the customer_id field as a key. Returns a JSON list of orders.”
  • B) “Use this tool when the user asks about their past orders or order history. Requires the customer’s verified ID. Returns a list of orders with timestamps and totals.”
  • C) “Customer orders tool.”
  • D) “Get orders for any user.”

The wrong-tempting answer is A — it’s the most detailed. The right answer is B. The model doesn’t care about Postgres or the customer_id field; that’s implementation. The model needs to know when to call the tool (“user asks about their past orders”) and what it gets back in user-facing terms. Putting implementation in the description is a real-world reliability bug — it gives the model the wrong mental model of when this tool is the right call.

Why people lose points here: They optimize prompts the way they optimize copy — for clarity to humans. The exam tests whether you optimize for clarity to a model trying to make a routing decision. Different audience, different rules.

Domain 4 — Tool Design & MCP Integration (18%)

Tool design is its own domain because it’s where most production agentic systems break down. The questions here are technical and concrete.

What it tests:

  • When to expose something as a tool vs. handle it in code
  • Tool granularity — one tool that does many things vs. many tools that each do one
  • Idempotency and when it matters
  • Error contract design — what an error message should look like to a model
  • The MCP spec itself — server vs. client, transport, capability negotiation
  • Resource and prompt exposure via MCP, not just tools
  • Authentication and authorization for tool calls
  • Tool composability — when one tool’s output is another tool’s input

Where it shows up in real work:

Every internal tool you wrap, every MCP server you build, every API you expose to an agent. If your team has built any kind of agentic infrastructure, you’ve solved or hit problems in every bullet above.

Concrete example question pattern:

Your MCP tool delete_customer_account is being called sporadically when the user is just asking about deletion, not asking to delete. Which mitigation is the strongest?

  • A) Improve the tool description.
  • B) Add a confirmation prompt to the tool.
  • C) Split into two tools: get_deletion_info (read-only) and delete_customer_account (destructive, requires explicit confirmation token).
  • D) Add a hook that warns the user before calling.

The right answer is C, but B is also defensible. The wrong-tempting answer is A — and a poorly written description IS part of the problem, but it’s not the strongest mitigation. The exam pushes you toward structural fixes for safety-critical operations: use the type system (separate tools) rather than relying on the model’s inference. “Programmatic enforcement vs. prompt-based guidance” — Anthropic’s exam guide flags this as the single most-tested concept across the whole exam, and this domain is where it surfaces most clearly.

Why people lose points here: Treating MCP as a transport detail. The exam tests it as a design philosophy — the question of what shape your tools should have is at least as important as how to wire them up.

Domain 5 — Context Management & Reliability (15%)

The smallest domain by weight, but the one that produced the most “I’d never even heard of this” comments from people who failed.

What it tests:

  • Effective context window vs. raw context window
  • Rolling window context — what it is, when it activates, what survives
  • Summarization-on-overflow behavior
  • Context pruning strategies you can implement
  • Cost vs. reliability tradeoffs (longer context = more cost, more drift)
  • Caching and prompt caching for cost control
  • Recovery patterns — what to do when a tool call fails repeatedly
  • Retries, timeouts, and circuit breakers in agentic systems
  • Detecting when an agent is stuck in a loop and breaking out

Where it shows up in real work:

Any agent that runs for more than a handful of turns. By turn 10 of a serious workflow, you’re in this domain whether you knew it or not. Production reliability — getting an agent to behave well on its 100th run, not just its first — is mostly this domain.

Concrete example question pattern:

Your customer support agent occasionally produces wildly off-topic responses, but only after long sessions (30+ turns). It’s running on a model with a 200K context window. The conversation history is well under that. What’s most likely happening?

  • A) The model is overheating from too many tokens.
  • B) The effective context window is shorter than the raw window — at some point earlier turns are summarized or dropped, and the agent loses important grounding.
  • C) Tool descriptions are conflicting.
  • D) The system prompt is too long.

The right answer is B. Effective context — the part the model actually attends to with full fidelity — is shorter than the published window for most production models. In long sessions, older turns get summarized or evicted. The fix isn’t a larger model; it’s structural: pin critical state in the system prompt, persist it via a tool, or summarize aggressively yourself rather than letting the model do it implicitly.

Why people lose points here: Underestimating it. 15% is the smallest weight, so people skim it during prep. But the questions are conceptually unfamiliar — “rolling window context” wasn’t standard vocabulary before Anthropic published their docs — and skimming this domain means walking in without the words for what’s being asked.

Which Domain Should You Study First?

This depends on your background. Here’s a fast-decision table.

Your strongest existing skillWhere you’ll lose pointsStudy first
Prompt engineeringAgentic architectureDomain 1 (27%)
Backend / systemsPrompt engineeringDomain 3 (20%)
Daily Claude Code userTool design / MCPDomain 4 (18%)
Built MCP serversCLAUDE.md hierarchy / hooksDomain 2 (20%)
GeneralistContext managementDomain 5 (15%)

The reasoning: study what you’re weakest at first, because your strongest area is already worth the most points to you per study hour. A great prompt engineer should still spend the bulk of their prep on agentic architecture — that’s where the gap between “what they know” and “what gets tested” is widest.

What This Means for You

If you’re studying for the exam: Stop treating the five domains as equal weight. Map your study schedule to the percentages. Spend at least a third of your prep on agentic architecture — it’s 27% of your score and it’s where most people lose. Spend less than a sixth on context management, but don’t skip it; the unfamiliar vocabulary is what catches people off-guard.

If you’re not taking the exam but you build with Claude: This breakdown is also a competency map for general AI engineering. The five domains are good real-world buckets. If you’ve never thought about MCP error contracts or rolling-window context behavior, those are useful holes to fill regardless of whether you ever get certified.

If you’re a hiring manager: When you’re vetting AI engineering candidates, the same five domains form a clean interview rubric. Asking about agentic architecture and context management together gives you signal on the harder skills; prompt engineering and tool descriptions test the surface fluency.

The bottom line: The CCA-F isn’t five equally important areas — it’s a 27/20/20/18/15 split, and you should study in those proportions. The most common failure mode isn’t being weak across the board. It’s being strong on prompt engineering (a 20% slice) and assuming that translates to passing the whole exam.

If you’d rather follow a prep path that drills scenario questions for each of these domains in proportion to their weight, our Claude Certified Architect Exam Prep degree walks through each domain with worked examples, common traps, and the kind of “the system is broken, what’s wrong” exercises the real exam tests. You can also get there with the free Anthropic Academy tracks plus disciplined practice — the cert is real and earnable through public resources.

Frequently Asked Questions

Are all five domains weighted equally? No. Agentic Architecture is 27%, Claude Code Config and Prompt Engineering are 20% each, Tool Design is 18%, and Context Management is 15%. The first two domains together are 47% of the exam.

Which domain is the hardest? Agentic Architecture. Multiple passers said the same thing — questions in this domain require systems-thinking, not recall. It’s also the heaviest-weighted, which is why it’s the highest-leverage area to study.

Which domain is most often underestimated? Context Management. It’s the smallest weight, but the vocabulary is the least familiar — concepts like “rolling window context” and “effective context vs. raw context” trip people up. Don’t skim it just because it’s only 15%.

How are the domains scored — flat percentage or scaled? The overall exam is scaled (720 of 1000 to pass), and the domain weights are reflected in question count and difficulty mix. You see your domain-level performance after the exam.

Can you fail one domain and still pass overall? Yes. The pass threshold is a total score, not a per-domain threshold. A genuinely strong showing on the heavy domains (Architecture + Claude Code = 47%) can offset a weaker showing elsewhere. But going below 50% on any single domain is a strong sign of a knowledge gap worth closing.

What’s the single most-tested concept across all domains? Programmatic enforcement vs. prompt-based guidance. The exam keeps coming back to this question: “When should a rule be encoded in code (a tool schema, a hook, a permission gate) versus expressed in a prompt?” The pattern: must-hold rules go in code; should-usually-hold guidance goes in prompts.

Are the domains likely to change? Anthropic could rebalance them in future versions. The CCA-F is the foundation level; advanced certifications are reportedly in development and may carve up these areas differently. The current breakdown reflects how Anthropic frames “the basics” of building production-grade Claude systems.


Sources:

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume