An SDLC for Code Generation

The call came late on a Tuesday. A founder I’d met once, briefly, at a conference. His voice had that particular flatness you hear from people who have stopped sleeping properly.

“We shipped it,” he said. “It’s in production. And nobody on the team can figure out how it actually works anymore.”

His company had built their SaaS platform in six weeks using AI. Six weeks — a genuinely remarkable number, the kind of number that gets you on a podcast. The problem was that they were now four months into “fixing the last few things,” and the fixes were generating new bugs faster than they were closing old ones. The code was fluent, well-commented, and thoroughly incomprehensible to every human who touched it. Their best engineer had quit the week before. The investors were starting to ask questions.

I’ve had variations of this conversation dozens of times in the last two years. Different founders, different stacks, same story. Somebody sold them the dream that AI would write the code — real code, production code — and that the old, boring discipline of software engineering could finally be retired. For a few weeks, it looked like they were right.

Then reality arrived.

The Fantasy That Keeps Failing

A beautiful dream took hold in boardrooms a couple of years ago. AI was going to write the software. Spreadsheets got updated, headcount projections got rewritten, and a particular strain of executive — the one who had always suspected that developers were a load-bearing nuisance rather than a strategic asset — began to smile in meetings.

I watched one company lay off most of their engineering team in a single afternoon. Eight months later they were hiring again, at higher salaries, under the title “AI Engineer.” Most of the new hires were the same people, now with a shinier LinkedIn headline and significantly less loyalty.

Here is what takes these projects down, and it isn’t AI.

Traditional software development has an SDLC — requirements, design, implementation, testing, deployment, maintenance — for a reason. It’s the accumulated wisdom of decades of shipping things and watching them break. When teams bolt AI code generation onto that SDLC unchanged, they find quickly that the old process was tuned for the rhythms of human developers, not for a system that produces code at a hundred times the speed and fails in entirely unfamiliar ways.

And on the other side, the “vibe coders” — the ones who threw the SDLC out entirely, opened a chat window, and started shipping. For about three weeks they looked like geniuses. Then production happened. It turns out that typing prompts very fast is not, in fact, the same thing as engineering.

Both camps are wrong for the same reason. They never asked the one question that matters: what actually changed? What changed is not the discipline of engineering. What changed is who does which parts of it. Until you work that out honestly, no process — old or new — will save you.

Know Thy Collaborator

Almost every failure I’ve seen traces back to a team that didn’t understand the respective strengths of the two parties at the keyboard. So let me be blunt.

Humans are better at lateral thinking — the messy, analogical “what if we did this completely differently” reasoning that produces genuinely new approaches. Humans are also better at decomposition: looking at a tangled business problem and slicing it into components that make sense for this product, with these constraints, these users.

AI is better at almost everything that comes after. It is phenomenally good at recognizing which well-known architectural pattern fits a requirement and implementing it cleanly. It is excellent at the plumbing — naming variables, handling errors, structuring files, keeping current with library versions. And it is, crucially, very good at evaluating human ideas. We fall in love with our own cleverness and miss the obvious holes. An LLM will find them in thirty seconds if you give it permission to.

Now the weaknesses. AI needs crisp prompts and accurate specifications. And humans — this is the part nobody wants to admit — are catastrophically bad at writing accurate specifications. That is the central deadlock of modern software, and it has been for forty years. We always assumed programmers would bridge the gap with judgment. Take the programmers out, hand the spec directly to a generator, and every ambiguity in the spec becomes a subtle bug in the product.

I worked with a team that spent eleven weeks writing what they called “the master specification.” A two-hundred-page document covering every feature, every edge case, every acceptance criterion. They fed it to the model in a single heroic prompt and generated the application in an afternoon. The system technically matched every bullet point in that document. It was also completely unusable, because the LLM was ovewhelmed with the specifications, and generated something way below what one could expect. Nobody had actually thought about how the pieces fit together — only whether each piece was listed. The model produced exactly what was asked for, which was a list of features in a trench coat.

There is a better way, and it doesn’t involve a two-hundred-page document.

Reasonable Expectations: Stop Setting Your Team On Fire

One more thing before we get into the workflow.

If this is your first serious project with generative AI, it is going to take longer than the demo suggested. Not as long as building it by hand — nothing like that long — but longer than the CEO’s LinkedIn post implied. Unreasonable expectations don’t make humans faster. They make humans panic. And panicked humans make terrible decisions about AI-generated code, which is already a domain where calm is the main professional skill.

Budget for a learning curve. Expect your first project to teach you how to do the second one properly. That is not a failure mode; that is how unfamiliar tools work, and anyone who tells you otherwise is selling something.

Also: learn the models. Opus, Sonnet, and Haiku have genuinely different personalities and are good at different things. Using Opus for every task is like using a Formula 1 car to pick up groceries. Using Haiku for deep architectural reasoning is like asking an intern to design your payment system. Match the model to the job. The money you save by doing this properly will pay for the discipline that makes the project succeed.

A Working SDLC for Generated Code

Here is the process I’ve settled on after enough projects to know what works. It is deliberately light on ceremony and heavy on judgment, because the point of a good SDLC for generated code isn’t to document everything — it is to put the right cognitive work at the right step.

Step 1: Chat. Actually Chat.

Don’t open a prompt-engineering blog post. Don’t craft the Perfect Prompt. Open claude.ai, select Opus from the model dropdown, and have a conversation the way you would with a sharp colleague who just joined the team.

Tell it what you’re trying to build. Tell it what you’re unsure about. Ask what it would do differently. Push back on the parts you disagree with. This is not a prompt; it is a conversation. Most people skip this step because it feels unproductive — no code is being generated, nothing is being shipped, nothing on the ticket board is moving. That is precisely why it is the most valuable hour of the project.

The founder from the opening of this piece skipped this step. He told me later that he wrote his first prompt in about twenty minutes, felt very clever about it, and handed it to an agentic coding tool. The tool did exactly what he asked. He never stopped to ask himself whether what he asked was right.

At the end of the conversation, ask Claude to produce a short concept.md capturing the core idea. Emphasize short. If it hands you an essay, push back and demand brevity. You want the heart of the product on one page, not a treatise.

Step 2: Architecture Is Still a Human Job

Every product needs its own architecture, data flow, storage, and access patterns. AI has a decent grasp of these in the abstract, but the specific shape your system should take is judgment work — yours, or someone you pay who has that judgment. I cannot overstate how often this is the step that separates the successful projects from the disasters.

Write the core architectural decisions on a notepad. Not a Confluence page. Not a hundred-page doc. Three paragraphs. The core ideas — nothing more.

Then look at the product and ask: how much of this is standard? If you’re building an e-commerce portal, an auth system, a CRUD admin panel — that’s solved territory. AI has seen ten thousand of each, and it will build them faster and cleaner than any human. The value you add is at the unusual parts, the parts that make your product actually yours. Decompose aggressively: isolate the standard components (where AI will fly) from the unique ones (where you need to be involved).

Now — and this matters — start a fresh chat. Long chats are tempting but they accumulate noise, waste tokens, and drift. I once watched a developer struggle for two hours because the model kept generating MongoDB schemas when he wanted Postgres. He couldn’t figure out why. Eventually we scrolled back through the chat and found it: two hundred messages earlier, during exploration, he had mentioned MongoDB in passing. The model had filed that away as context and was gamely trying to be consistent with it. Fresh chat. Clean context. Every time.

Drop in your concept.md, share your architectural thinking, and ask for its opinion. Ask about pitfalls. Ask about alternatives. Make it argue with you. This is where you extract the most value from the model — not in generation, but in critique.

End with an architecture.md that captures the decisions: components, data flow, language, cloud, database, scalability approach, security and privacy considerations. No file names. No variable names. No table schemas. Architecture only. Say that explicitly when you ask for it, because the model will happily volunteer details you don’t want yet.

Step 3: From Ideation to Implementation

Now switch gears. Open your IDE — I use VS Code with the Claude plugin — and drop architecture.md into the project.

In the terminal, ask Claude (Sonnet is the right model here — the workhorse for detailed generation) to read architecture.md and write a detailed prompt file for the first component. Remind it to consider how this component interfaces with the others. This is the prompt where specifics matter — data structures, variable names, error handling, the full set of concerns.

Work through the components in order, either from source-of-data outward or from output backward. Don’t zigzag. For each component: ask for the prompt, review it, make sure it is consistent with the other prompts, and update as needed.

Here is where restraint pays off. Don’t waste time on things AI does better than you. Don’t lecture it about error handling. Don’t debate naming conventions. Don’t micromanage syntax or library versions. Focus your energy on what is unique to this product — the parts where your judgment actually matters. The rest is plumbing, and it is plumbing the model has installed a thousand times.

One critical discipline: reset the session between components. A new component is a new context. Dragging in the baggage of previous conversations wastes tokens and muddies the model’s focus. When all the component prompts exist, do one final pass across them to ensure coherence, then generate a setup prompt for the overall project — explicitly asking for current library versions.

Step 4: Implementation — Bring in the Cheaper Model

Switch to Haiku. Yes, seriously. For well-scoped implementation prompts, Haiku is fast, cheap, and perfectly capable — and the money you save here funds the more expensive review passes later.

Feed it one prompt file at a time. Let it implement, then ask it to review its own work and add thorough comments. Reset the session between prompts. When every component is done, ask Haiku to review the full codebase and flag missing pieces.

Then — and this is important — reset again and ask Opus to do a senior review. Does the code do what the comments claim? Does it match the prompts? Are there gaps between components? Opus is pricier, but it catches the class of problems cheaper models miss, and it catches them before they become production incidents.

This two-tier review pattern — cheap model generates and self-reviews, expensive model audits — is one of the highest-leverage habits you can build. I inherited a project once where every module had a clean, reassuring comment header describing what it did. It took me two days to realize the comments were wrong. They had been generated in a separate pass, after the code, and the model had politely hallucinated what the code “probably” did. The actual behavior diverged in at least a dozen places. Opus, given the chance, would have caught that in a single review pass. Nobody had given it the chance.

Step 5: Testing — But Not the Way You Think

Have Sonnet generate test cases: unit tests for each component, and integration tests for the product as a whole. Don’t rush to automate. Sanity tests first. Automation later, once you know what stable actually looks like.

One warning about testing generated code, because I have seen this destroy a project. If the same model writes the implementation and the tests in the same session, you have tested what the model did, not what the product should do. That is the AI equivalent of a developer grading their own homework. I once audited a codebase that proudly reported 100% test coverage. Roughly 40% of the features did not actually work, because every test had been written to agree with the implementation. The tests and the code shared the same hallucinations.

Separate the concerns. Derive tests from the architecture and the requirements, not from the generated code. Property-based tests and fuzzing earn their keep here because they don’t share the model’s blind spots.

Step 6: The Humbling Phase

If you have followed the process, you can reach this point in under a day. Now try to actually compile, build, and run the thing.

Here comes the humbling part. There will be errors. Compile errors, build errors, functional errors, integration errors — the whole menagerie. Haiku can handle most of the compile and build issues. Sonnet works through the functional problems. Let it run its own sanity tests and watch the components light up one by one, then verify they actually work together.

If the first five steps took eight hours, step six can take another eight days. That is not a flaw in the process. That is the cost of honest engineering, and it is still dramatically less than building the whole thing by hand.

The Things Nobody Tells You

A few cross-cutting lessons that don’t fit neatly into any single step but will determine whether the project thrives or quietly rots.

Prompts are source code. Version them. Review them. Commit them. When you regenerate a component six months from now with a different model, you’ll want to know exactly what instructions produced the original behavior. Teams that treat prompts as throwaway scratch notes are building on sand, and I say this as someone who has had to reverse-engineer more than one such project.

Generated code still needs human-grade review. In fact it needs more care, not less, because fluent code invites skimming. It looks clean. It reads well. It passes the happy path. Reviewers should specifically hunt for things LLMs reliably miss — security edge cases, performance under load, race conditions, unstated invariants. Scale review depth to risk, not to diff size.

Watch the supply chain. LLMs will confidently recommend packages that don’t exist, are deprecated, or — most alarmingly — have been typosquatted by people who noticed the hallucination pattern and registered the names. A team I was advising ran npm install on a recommended package that turned out to be a credential-harvesting imposter of a real library. We caught it before it hit production. Not everyone does. Pin dependencies. Use allow-lists. Audit what comes in.

Build an eval suite early. Models change. Prompts drift. Team habits evolve. Without a suite of evaluations that tell you “yes, the system still behaves correctly,” you are flying blind. This is the single most under-invested area in AI-assisted development, and the teams that figure it out early look like wizards in year two.

Maintenance is where the bodies are buried. The real question isn’t “can we generate this?” It is “can we maintain this in a year, when the person who generated it has moved on and the model’s defaults have shifted twice?” If no human on your team can explain what a module does and why, you don’t have a codebase. You have an archaeology site. And the archaeologists charge by the hour.

Taming the Tool: Claude Code Hacks That Actually Matter

Everything above is about process. But process without tooling is a motivational poster. If you’re using Claude Code — and you should be — there are a handful of configuration tricks that will save you more time than any prompting technique.

Auto-Format on Every Write

This one sounds trivial. It isn’t. It saved a project for me.

Claude has opinions about formatting. Prettier has different opinions. And your team has a third set of opinions codified in a .prettierrc file. When Claude writes a file, it formats the code its way. The moment you save that file, Prettier reformats it. Now your diff shows forty-three changed lines, thirty-nine of which are whitespace. Try reviewing that. Try spotting the one-line logic error hiding in the noise.

The fix is a PostToolUse hook that runs Prettier automatically every time Claude writes or edits a file. Add this to your .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit|MultiEdit",
        "hooks": [
          {
            "type": "command",
            "command": "npx prettier --write \"$CLAUDE_TOOL_INPUT_FILE_PATH\""
          }
        ]
      }
    ]
  }
}

Claude never sees the reformatting. It doesn’t waste tokens arguing about semicolons. Your diffs are clean, reviewable, and reflect only the actual changes. This single hook eliminates an entire category of confusion that I’ve watched teams lose hours to.

Let CLAUDE.md Be the Memory You Don’t Have

Claude Code has no memory between sessions. Every time you start a new chat — and as we discussed, you should be doing that often — it walks into the room not knowing where the light switches are. CLAUDE.md is how you fix that.

Once your code is generated and stable, ask Sonnet to produce a CLAUDE.md for the project. Not a novel — a concise onboarding brief. The tech stack, the project structure, how to build and test, the conventions you chose, and the architectural decisions that aren’t obvious from the code alone. Think of it as the note you’d leave for a sharp contractor starting on Monday.

But here’s the trick most people miss: your CLAUDE.md goes stale the moment you change something and forget to update it. So don’t rely on yourself to remember. Use a SessionEnd hook to prompt an update automatically:

{
  "hooks": {
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo '{\"additionalContext\": \"Review CLAUDE.md against changes made this session. Update if any architectural decisions, conventions, or key commands have changed.\"}'"
          }
        ]
      }
    ]
  }
}

Now, every time a session ends, there’s a nudge to reconcile the documentation with reality. The teams that maintain their CLAUDE.md religiously are the ones whose second and third sprints run faster than the first. The ones who don’t are starting from scratch every Monday.

One more thing about CLAUDE.md: keep it short. I’ve seen teams write thousand-line configuration files stuffed with every rule they could think of, and then wonder why Claude ignores half of them. Research suggests frontier models can reliably attend to about 150–200 instructions. After that, important rules get lost in the noise. If Claude already does something correctly without being told, delete that instruction. If a rule can be enforced by a hook or a linter, enforce it there instead of hoping the model remembers.

Block the Catastrophes Before They Happen

A PreToolUse hook can intercept dangerous commands before Claude executes them. This costs you thirty seconds to configure and saves you the one time Claude decides to rm -rf your build directory or overwrite your .env file:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "echo \"$CLAUDE_TOOL_INPUT\" | grep -qE 'rm -rf|DROP TABLE|>.env' && exit 2 || exit 0"
          }
        ]
      }
    ]
  }
}

Exit code 2 blocks the action and tells Claude why. It’s a seatbelt. You don’t notice it until the one time it saves you.

Auto-Run Tests Before Claude Declares Victory

Claude will tell you it’s done. It will sound confident. It will be wrong. A Stop hook that runs your test suite before the agent finishes means Claude can’t walk away from a broken build:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "npm test 2>&1 || exit 2"
          }
        ]
      }
    ]
  }
}

If tests fail, exit code 2 sends Claude back to fix the problem. This alone eliminates the single most common failure mode I see in generated codebases: code that the model considered complete but that doesn’t actually pass its own tests.

Inject Context at Session Start

Remember the MongoDB-versus-Postgres disaster? A SessionStart hook can automatically inject the current branch name, recent git history, or a summary of the project state — so every fresh session starts with orientation instead of amnesia:

{
  "hooks": {
    "SessionStart": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo '{\"additionalContext\": \"Branch: '$(git branch --show-current)'. Last 3 commits: '$(git log --oneline -3 | tr '\\n' '; ')'\"}'"
          }
        ]
      }
    ]
  }
}

Use /compact Before It Uses You

Claude Code has a context window. Long sessions fill it. When it overflows, the system auto-compacts — summarizing the conversation to free space. The problem is that auto-compaction doesn’t know which details matter to you. I’ve watched critical architectural decisions evaporate because the compactor decided they were less important than a long debug trace.

The fix: compact manually and tell it what to preserve. Run /compact Focus on the API contract changes and the database migration decisions before the window fills, and you control what survives. Better yet, if a task is getting long, finish it, commit, and start a fresh session. A clean context is almost always better than a compressed one.

Git Worktrees for Parallel Claude Sessions

This one is for the ambitious. If you’re working on multiple components simultaneously — which, with the process described above, you should be — use git worktrees to give each Claude Code session its own working directory. Each session gets its own branch, its own files, its own context. No cross-contamination, no merge conflicts mid-session, and you can run three implementations in parallel without them stepping on each other.

git worktree add ../project-auth feature/auth-component
git worktree add ../project-payments feature/payments-component

Open a separate terminal for each worktree, run Claude Code in each, and let them work independently. Merge when each component is stable. This is the closest thing to a genuine productivity multiplier I’ve found.

The Meta-Hack: Treat Configuration as Code

Every hook, every CLAUDE.md, every slash command — commit them. Version them. Review them in PRs just like you would application code. When a new team member joins, they clone the repo and inherit your entire Claude Code setup. When something breaks, you can bisect the configuration just like you’d bisect the code.

I’ve seen teams where every developer has their own private set of prompts and hooks, none of which are shared, none of which are versioned. They’re reinventing the wheel in every session and wondering why the output quality varies from person to person. Your Claude Code configuration is part of your engineering infrastructure. Treat it that way.

What I’ve Come To Believe

I started this piece with a founder who called me late on a Tuesday night. I’ll tell you how his story ended.

We stopped trying to fix the system. We threw it out. Then we spent three days doing step one — just talking, mapping what the product actually needed to be, building a concept document on a single page. Four more days on architecture. A week turning components into prompt files. The cheap model did the implementation, the expensive one audited, and we spent about ten days wrestling the thing into a working state.

Three weeks, start to finish. A cleaner product than the one they had spent six weeks building and four months failing to fix. The founder sleeps now. His engineers are no longer quitting.

This is the pattern. It is not magic and it is not complicated. It just requires the one thing most teams never do, which is to stop and think before they start generating.

Generative AI is not a replacement for engineering judgment. It is a force multiplier for teams that have it, and an expensive way to accumulate technical debt for teams that don’t. The projects that fail are not failing because the tools aren’t ready — the tools are remarkable, genuinely some of the most exciting technology I have worked with in twenty years. They fail because someone decided that if the AI could do “almost everything,” the humans could do almost nothing. That math has never worked in any other engineering discipline, and it does not work here either.

Done right, with this process or something like it, generative AI lets a small, thoughtful team ship in weeks what used to take quarters. The code is cleaner. The architecture is more deliberate. The humans on the team spend their time on the parts that genuinely required them — which are, not coincidentally, the parts they enjoy most. This is the most exciting shift in software development in my career, and I am not prone to overstatement.

But it requires you to do the work. Not the work of typing code — we’ve offloaded that — but the work of thinking clearly, specifying precisely, decomposing wisely, and reviewing honestly. The teams that master these skills are going to build remarkable things this decade. The teams that don’t will spend it debugging mystery code nobody remembers writing.

If your current project is in the second bucket, it is not too late. But you are going to need a different process than the one that got you there.

And probably someone to show you where to start.