Rewrite vs refactor: the decision tree we actually use

Most engineering teams have a rewrite conversation at least once a year. Some have it every quarter. The conversation usually starts the same way: a senior engineer gets frustrated with a piece of legacy code, draws a box on a whiteboard, and says "we should just rewrite this". A product manager says "how long?" The engineer says "three months". Six months later the team is eighteen months in, the old system is still running alongside the new one, and nobody wants to be in a room together to discuss priorities.

Rewrites are one of the most expensive decisions an engineering team can make, and they're one of the few decisions where the conventional wisdom ("never rewrite" vs "rewrite boldly") doesn't help, because both are right some of the time. This post is the decision tree we work through with clients — and with ourselves, on our own products — to keep those conversations productive.

The three failure modes

Before the tree, you need to recognize the three predictable ways rewrite decisions go wrong.

Failure mode 1: "The new stack is better" (the aesthetic rewrite)

The engineer championing the rewrite can describe, in detail, why the new stack is more elegant. They cannot describe, in specific business terms, what changes for a user or a customer when the rewrite is done.

This is the most common one. It's not necessarily wrong — sometimes the stack really is bad and needs to go — but if the case for the rewrite is entirely technical, the case won't survive contact with a budget review six months in. Something will come along that's more urgent, and the rewrite will be paused with 40% of the work done, which is the worst possible state to be in.

Failure mode 2: "We'll do it alongside the old one" (the parallel rewrite)

The plan is to build the new system in parallel, feature by feature, and cut over gradually. This sounds responsible and is almost always a disaster. Here's why: the team is now maintaining two systems, the old one keeps accumulating features (because the business won't stop), the new one has to race to catch up, and by month nine the new system is still missing features and the old system has moved further ahead.

Parallel rewrites work in exactly one condition: the old system is genuinely frozen and the business has the discipline to say no to new features on it. Without that discipline, parallel rewrites are the shape most failed rewrite projects take — and most attempts fail for exactly this reason.

Failure mode 3: "It's just a refactor" (the stealth rewrite)

The team doesn't get permission for a rewrite, so they plan a "large refactor" that is, in practice, a rewrite touching 80% of the codebase. This is the version that hurts the most because nobody — including the team doing it — has internalized that they're in a rewrite project with rewrite-level risk. There's no cutover plan, no parallel-running strategy, no rollback. One day a PR lands that is technically a refactor but semantically a rewrite, and things break in production in ways the existing test suite doesn't catch.

If your "refactor" touches most of the codebase, it is a rewrite. Name it. Plan it. Don't smuggle it in.

The decision tree

Here's how we actually run the call. Four questions, in order. If the answer to any of them is "no" or "we don't know", you don't have a rewrite candidate yet — you have a scoping problem.

Question 1: Is the pain measurable and chronic?

Not "this code is ugly". Not "this framework is deprecated". Specifically: is there a business number that's moving in the wrong direction because of this system?

Examples of yes:

Time-to-ship for a typical feature has grown from 3 days to 3 weeks over the last year, and you can trace it to a specific subsystem.
On-call pages from this system are 40% of total pager load.
Onboarding a new engineer to this codebase takes 4 weeks longer than onboarding them to any other system.
You've lost two hires who said the codebase was a factor in their decision to leave.

Examples of no:

The code is hard to read.
The tests are flaky.
It uses an old library.

If the pain is chronic and measurable, you can write the business case. If it isn't, a refactor is what you want — not a rewrite.

Question 2: Can you freeze the old system for the duration?

By "freeze" we mean: the business is willing to commit, in writing, that no new features land on the old system until the new one is at parity and the cutover is done. Bug fixes yes. Security patches yes. New features no.

If the answer is "well, we'd try, but we can't make promises" — that's a no. Go back and have the conversation with product leadership until you get a real yes or a real no. A soft maybe is a disaster.

If you can't freeze, you have two options: either (a) find a bounded slice of the system that can be frozen and rewrite just that, or (b) do a structured refactor instead, which can proceed in parallel with ongoing feature work because it preserves public behavior.

Question 3: Do you have a cutover plan that's safe to execute incrementally?

A rewrite needs a plan that lets you flip a switch on a fraction of traffic at a time, verify correctness, and roll back instantly if something goes wrong. Not "we'll spend a weekend doing a big-bang migration". Big-bang migrations are where rewrites go to die.

The shape we look for:

Shadow traffic: the new system receives real production traffic in parallel with the old, but its outputs are logged rather than acted on. You compare old vs new output for a week or two.
Canary: 1% of real users get the new system's output. You measure error rates and business metrics. If they move, you roll back.
Ramp: 1% → 5% → 25% → 100% over a period long enough to catch edge cases.
Fallback: the old system is still running and can be re-enabled by a config flip for at least 90 days after 100% cutover.

If your rewrite target doesn't support this kind of incremental cutover — maybe because it's the database layer, or because outputs aren't easily diff-able — the risk profile goes up by an order of magnitude. That doesn't mean don't rewrite it. It means the success criteria and the review gates need to be much stricter.

Question 4: Do you have a clear definition of done?

"Feature parity with the old system" is not a definition of done. Nobody knows what's in the old system. Nobody has a complete list. The old system has accumulated features that are used by two customers and have no documentation and break in weird ways.

What we require:

A list of user-visible features, with owners, signed off by product.
A traffic report showing which features are used and by how many customers.
An explicit decision, per feature in the bottom quartile of usage, whether it comes with you to the new system or gets dropped. ("We'll figure it out later" is not a valid answer.)
A success criterion tied to the measurable business pain from Question 1. The rewrite isn't done when the code ships. It's done when the business number you set out to move has moved.

If you can't define "done", you will never ship. The project will drift for a year, get deprioritized, and become the next team's legacy system.

What you should do instead, most of the time

Here's the uncomfortable part. When we run a team through this decision tree, most proposed rewrites don't survive. The answer most of the time is: you have a refactoring problem, not a rewrite problem.

What that looks like in practice:

Identify the 20% of the codebase that causes 80% of the pain (usually there are one or two modules that everyone dreads touching).
Put a boundary around them — an interface, a facade, a clear input/output contract.
Refactor the internals behind the boundary without changing the contract.
Repeat, one bounded module at a time, alongside normal feature work.

This is less exciting than a rewrite. It doesn't give the team the emotional catharsis of burning the old code. It doesn't come with a shiny new stack. But it ships, it's safe, and it doesn't require the business to freeze feature work. Ninety percent of the time it's the right call.

The ten percent of the time a real rewrite is the right call: the pain is chronic and measurable, the business can freeze, the cutover plan is safe, and the definition of done is sharp. That ten percent is where rewrites succeed. Outside of it, they become cautionary tales.

The short version

If the pain isn't measurable, you have a refactoring problem.
If the business can't freeze, you have a refactoring problem.
If you can't cut over incrementally, you have a risk problem to solve first.
If you can't define done, you have a scoping problem.
If you answer yes to all four, you probably do have a rewrite on your hands — and now you have a business case that will survive a budget review.

If you're in the middle of one of these conversations and you'd like someone outside the organization to help you run the call, get in touch. An outside perspective is often the thing that keeps a rewrite debate productive instead of political.