I Refactored a 2,000-Line TypeScript File With GPT-5.5 in Codex: 3 Wins, 2 Losses, and the Workflow I'm Keeping

GPT-5.5 hit 82.7% on Terminal-Bench 2.0 inside Codex. Here's a real 2,000-line TypeScript refactor that put it through its paces — what worked, what failed, and the workflow that survived contact with production code.

OpenAI’s GPT-5.5 has been rolling out inside Codex for a few weeks now, and the headline benchmark — 82.7% on Terminal-Bench 2.0 — is the part of the marketing that’s actually been reproducing in independent tests. That puts GPT-5.5 in the same tier as Claude Opus 4.6 and DeepSeek V4-Pro on agentic-coding tasks, which is the exact territory where most working developers live.

But benchmarks don’t refactor production code. So this week I took a 2,000-line TypeScript file from a real codebase — admittedly a mid-sized side project, not a Fortune 500 monorepo — and ran it through GPT-5.5 in Codex with a non-trivial refactor brief. The goal: split a fat-controller-style React component into a proper hook + presentational component pair, rename a confused state machine, and tighten up the types around a feature flag system.

Three things went well. Two things broke. Here’s the workflow that came out of it.

The Setup

For context, the file:

  • 2,043 lines of TypeScript / TSX
  • A single React component handling auth flows, feature flag checks, analytics emission, and three nested state machines
  • Roughly 60 distinct branches across the state-machine logic
  • 14 imported types from a shared module that had drifted out of sync with actual usage
  • About 30% test coverage on the component (paired Jest + React Testing Library)

The refactor brief I gave Codex with GPT-5.5:

“Refactor this component into a presentational React component plus a custom hook. The hook owns the auth flow, feature flag resolution, and the three state machines. The presentational component takes the hook’s return value as props and only handles rendering. Preserve all existing branches. Rewrite the state machine names to use the present-tense verb pattern (loading → loadingState, etc., where the type names are also clearer about their lifecycle phase). Tighten the types around the feature flag system to use a discriminated union instead of the current loose-record pattern. Don’t break any existing tests.”

This is a real-shape refactor, not a contrived benchmark task. Multiple concerns that have to be teased apart, type-system work that needs precision, and a hard “don’t break tests” constraint.

Win #1: The Hook Extraction Was Cleaner Than Mine Would Have Been

The first thing GPT-5.5 did, after about 90 seconds of Codex tooling around (reading the file, sampling adjacent files, checking the test suite), was produce a hook design that I genuinely would not have thought of.

Specifically: instead of one useAuthFlow hook that returned a giant object, GPT-5.5 proposed three composed hooks — useAuthState, useFeatureFlags, and useFlowMachine — and a wrapping useAuthFlow that combined them. The presentational component took five named props instead of one massive object. Each hook had its own test surface.

This is the kind of compositional design refactor that’s easy to plan in the abstract and hard to actually execute on existing code without breaking branches. GPT-5.5 executed it. The resulting structure was actually better than what I’d have produced in the same time, partly because GPT-5.5 has clearly seen a lot of well-factored React code in its training and reaches for compositional patterns by default.

For the architectural-decision part of refactoring, the model is now reliably better than mid-level engineering instinct. That’s a real change.

This isn’t an isolated impression. @om_patel5 on April 14 (1,247 likes, 149K views, 116 reposts) summarized 120 hours of direct comparison from a 14-year principal engineer co-developing across an 80K LOC Python/TypeScript project: “Claude feels like an engineer on a time crunch: speeds toward getting things working… leaves tasks half-done mid-migration… almost never creates new files — just bloats existing ones. Codex feels like a 5-6 year senior: stops mid-task to rethink and refactor unprompted… doesn’t extend god classes — it factors them out… does things you hadn’t thought of that are actually additive… building enterprise software? Codex wins.” The “factors out god classes vs bloats existing ones” framing is the cleanest single sentence I’ve seen on the architectural-instinct delta — and it matches what showed up on my refactor task.

Win #2: The Type-System Work Was Surgical

The second win was the discriminated union refactor on the feature flag system. The original code had something like:

type FeatureFlags = Record<string, boolean | string | number | undefined>

Which — fine, it works, but it loses every type-safety affordance the moment you actually use it. The refactor brief was to tighten this to a discriminated union.

What GPT-5.5 produced:

type FeatureFlag =
  | { kind: 'boolean'; name: string; value: boolean }
  | { kind: 'string'; name: string; value: string; allowedValues?: readonly string[] }
  | { kind: 'number'; name: string; value: number; min?: number; max?: number }
  | { kind: 'percentage'; name: string; value: number }; // 0-100

type FeatureFlagMap = ReadonlyMap<string, FeatureFlag>;

Plus 14 type-narrowing helper functions and updates to all 27 call sites in the file. Every existing usage continued to compile. Every test passed. The type system now actually catches the case where someone passes a 'true' string where a boolean flag is expected — which had been a real production bug six months earlier.

This is the kind of work that’s tedious and error-prone for humans (you have to track 27 call sites and not miss any) and well-suited to a coding model with the patience to enumerate. Type-system migration work is now genuinely faster with GPT-5.5 than without it. Note this is the same category where Claude Opus has historically been strong; both models are solid here.

Win #3: It Asked the Right Clarifying Question

About 4 minutes in, GPT-5.5 paused with a clarifying question:

“Before I rewrite the state-machine names, I see two competing naming conventions in the existing tests: some tests reference LoadingState, others reference IsLoading. Which is the canonical pattern you want me to standardize on, or should I introduce a new pattern uniformly?”

That’s a good question. The existing inconsistency was a real source of bugs (state-name drift between component and tests). I would not have noticed it without the model surfacing it.

I told it to standardize on the LoadingState pattern uniformly and to flag any test that referenced the old IsLoading form so I could review the rewrite. It did exactly that. The output included a separate file noting which tests had been touched and why.

Asking the right clarifying question at the right time is one of the harder things to evaluate in a coding model. GPT-5.5 did it well here.

Loss #1: It Confidently Broke a Subtle Test

The first failure was real and worth flagging. There was a test that asserted a specific sequence of analytics events fire during the auth flow:

expect(mockAnalytics.track).toHaveBeenNthCalledWith(2, 'auth_step_completed', expect.any(Object));

The original code emitted the events in a specific order because of how the state machine sequenced its transitions. GPT-5.5’s refactored hook also emitted the events — but in a slightly different order, because the new compositional hook structure made the analytics emission happen synchronously where it had previously been microtask-delayed.

Functionally identical from a user’s perspective. But the test broke.

The model didn’t notice. It reported tests passing because Codex’s default test run uses --watch=false --silent, and the failure was on a single assertion. The model’s summary said “all 47 tests pass.” They did not. 46 passed; 1 failed silently.

I caught this because I rerun the test suite manually after model-driven refactors specifically because of bugs like this. The lesson: don’t trust model-reported test results. Re-run them yourself, with full output. This isn’t a GPT-5.5-specific problem — Claude has done the same thing — but it’s the most consistent failure mode I’ve observed across coding models in 2026, and GPT-5.5 didn’t fix it.

Loss #2: It Hallucinated a Library Method

The second failure was more annoying. The original code used a feature flag library — call it flagsmith-lite for genericness — with a specific API. GPT-5.5’s refactored code called a method on the library that doesn’t exist:

const flags = await flagsmith.evaluateBatch(flagNames, context);

There is no evaluateBatch method on this library. There’s evaluate (single flag) and a different pattern for bulk-loading at startup. GPT-5.5 invented the API based on what it thought a “reasonable” library would offer.

This is a known and quantifiable failure mode for coding models when working with libraries that aren’t in heavy training-data representation. Wang et al. (arXiv:2407.09726, 2024) measured this directly: GPT-4o achieves only 38.58% valid invocations on low-frequency APIs — meaning the majority of calls to less-common libraries are incorrect or hallucinated. flagsmith-lite is a moderately popular library — not obscure, but not React or Lodash either — and the model’s behavior was “confidently invent a plausible API,” which the academic measurement says is the typical case, not the exception. Tests didn’t catch this because the test mocks happened to mock the (nonexistent) method without complaint, since they’d been auto-generated alongside the refactor.

I caught it on the first run-through-in-the-browser, where the actual library exception fired. This is a 60-second-debug situation, not a serious problem — but it cost me the time, and on a less fortunate run it could have shipped to a staging environment.

Lesson: any library call the model emits that you haven’t personally seen before, look up in the actual library docs. Models are not yet trustworthy on library APIs outside the top 50 libraries.

The pattern shows up at scale, too. r/OpenAI user u/exscionewhuman on April 24, working a 300K-line repo: “Gpt 5.5 regressed and broke quite a few fixes that I had in place since I started using it. It seems to need to compact the context more, and when it does, it hallucinates and starts working on tasks that were already completed, or bringing in things I said that aren’t related anymore… So now I’m burning tokens trying to fix the issues that it caused.” On larger codebases, the failure mode shifts from “invents library APIs” to “invents prior context” — the same root cause (confident interpolation when the model lacks ground truth), different surface presentation. Same lesson either way: diff-test the output against what the codebase actually does, don’t trust the model’s summary.

The Workflow I’m Keeping

After a few hours of this — including some additional runs on smaller refactors to triangulate — the workflow that emerged:

Step 1: Architecture brief, executed by the model. Let GPT-5.5 (or Claude Opus, or DeepSeek V4-Pro — all of them are good here) handle the compositional refactor decisions. They reach for cleaner patterns than mid-level engineering instinct and they’re patient about enumeration. This is the highest-leverage part of the workflow.

Step 2: Clarifying questions, answered by me. When the model asks a clarifying question, take it seriously and answer specifically. The cost of a 30-second clarification is worth orders of magnitude less than the cost of redoing a refactor on a wrong assumption.

Step 3: Re-run tests with --verbose --no-watch myself. Do not trust the model’s report. Run npm test -- --verbose (or your equivalent) and read the output line by line. Flagged failures are the actual contract.

Step 4: Cross-reference any library API the model calls that I don’t personally recognize. Open the library docs. Search the symbol. If it doesn’t exist, the model invented it. Fix.

Step 5: One human pass through the diff before commit. Not a code review of every line — a structural read for: did the model preserve the branch logic? Are there any obvious “this looks too clean to be right” sections that warrant double-checking?

Step 6: For multi-file refactors, run on a feature branch, push, let CI run. Don’t trust local test runs as the final authority on cross-module changes.

That’s it. Six steps. None of them are exotic. All of them save real time vs naive “let the model do it and trust the output.”

How GPT-5.5 Compares Right Now

Honest take, since I’m running multiple coding models in 2026:

GPT-5.5 in Codex — strongest on architecture/composition decisions, very good on type-system work, occasional library hallucinations, asks good clarifying questions. Best for: structured refactors, type migrations, multi-step compositional design.

Claude Opus 4.6/4.7 (or via DeepSeek V4-Pro Anthropic-compatible endpoint) — strongest on long-context coherence (1M context wins for monorepo work), patient on tedious enumeration, slightly better library-API accuracy in my testing. Subject to the silent-degradation issues that triggered the April 23 Anthropic Claude Code postmortem — I now run the reliability audit framework on Claude weekly. Best for: large-context refactors, where the file count exceeds what fits in GPT-5.5’s working window.

DeepSeek V4-Pro — at 1/7th the price, surprisingly close to Opus on agentic coding. The best value-for-money option right now. Hallucinates more on niche libraries; otherwise comparable. Best for: cost-sensitive work, exploration, second-engine fallback when Anthropic’s having a bad week. (Setup tutorial here.)

For my actual day-to-day, I run GPT-5.5 in Codex for new feature work, Claude (audited weekly) for long-context refactors, and DeepSeek V4-Pro as the second engine I can swap to in 30 seconds when needed. Three engines, swap based on workload, audit each one’s reliability separately.

What This Means If You’re Evaluating GPT-5.5

If you’re a working dev evaluating whether to bring GPT-5.5 into your refactor workflow:

  1. The Terminal-Bench 2.0 number (82.7%) is real, in my experience. Marketing isn’t overstating it. @HackingDave ran GPT-5.5 in Codex against Opus 4.7 in Cursor on real work and reported “GPT 5.5 producing about 8-13% better code quality, 8-12% less bugs, and 27% more thoroughness on implementation features but GPT 5.5 is about 20-23% slower… insanely better at evidence based bug hunting and root cause analysis.” — match my own findings within rounding error.
  2. The library-hallucination and silent-test-failure issues are real and they’re unchanged from earlier models. Don’t expect GPT-5.5 to have fixed them.
  3. Architecture-and-composition tasks are now reliably better than mid-level human instinct. This is genuinely new and it’s the highest-leverage thing.
  4. The Codex workflow specifically is well-tuned now. Tooling around the model — file reads, test runs, diff display — is materially better than 6 months ago.

Run a refactor through it this week. Pick a real file. Apply the 6-step workflow above. You’ll know within 90 minutes whether GPT-5.5 belongs in your stack permanently or whether your existing tools (Claude, DeepSeek, Cursor, whatever) are still the right primary.

For deeper grounding on what GPT-5.5 actually changes for everyday ChatGPT users (non-coding side), our GPT-5.4 for ChatGPT Users course covers the everyday-use foundations and now includes the GPT-5.5 update notes. For the broader “which coding engine when” decision framework, the Claude Code Reliability Audit course goes deep on the multi-engine ops pattern that this workflow is part of.

The summary: GPT-5.5 in Codex is real progress. Not magic, not mid-2024 hype. Real progress. The workflow above is what makes it useful.

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume