Spec-Driven Development: Keeping Humans in the Loop While Scaling AI-Assisted Code Review

A field report from building Real Polite Protocol, an open protocol and reference implementation on Deno Deploy.


The problem with “vibe coding” at scale

AI coding agents have gotten genuinely good. They will happily implement a feature end-to-end, write the tests, run them, and report green. The trouble is that “the tests pass” is a much weaker signal than it used to be — because the agent that wrote the code also wrote the tests, and both were optimizing to satisfy a prompt you typed two minutes ago.

That is fine for a throwaway script. It is a disaster for a protocol implementation, an API surface, or anything where what the system is supposed to do matters more than whether today’s code happens to do something.

The reviewer’s job is being squeezed from two sides at once. Agents can produce more code per unit time than any reviewer can meaningfully read. And the prompts driving them are ephemeral, so by the time you notice a behavior drifted, the conversation that justified it is gone.

The technique I want to describe is the one I’ve been using on Real Polite Protocol (RPP). It is not novel — formal methods folks have been doing this for decades — but it adapts cleanly to a workflow where most of the typing is done by an AI agent and most of the deciding still needs to be done by a human.

The authority chain

The core idea is a strict, written authority order that every change in the repo must respect:

  1. RFC / spec — rpp-spec.md
  2. Requirement documents.github/requirements/**
  3. Requirement testssrc/requirements/**
  4. Scenarios.github/scenarios/** + src/scenarios/**
  5. Implementation codesrc/**

A lower-authority artifact MUST NOT contradict a higher-authority one. When a test fails, the implementation gets fixed — the test is not weakened. When the desired behavior conflicts with a requirement, the requirement (and possibly the spec) is changed deliberately and explicitly, then the tests and code follow.

Tying specs directly to tests does something subtler than enforce discipline — it gives the agent durable context across features. The requirement files are not scratch notes thrown away at the end of a conversation; they accumulate as a structured record of every behavior the system has ever committed to. When a user asks for a change that is contradictory, partially overlapping, or just complex enough to interact with three other features, the agent loads the relevant requirement files and reasons against the full history of decided behavior, not against whatever the current code happens to look like. The default failure mode of an unscoped agent is to silently regress old requirements while satisfying the new prompt — because the old requirements existed only in a prior conversation it cannot see. With the requirements pinned in the repo and pinned to tests, the agent has to confront the conflict explicitly: either the new request is consistent with the existing requirements, or it requires changing one of them, and that change is itself a reviewable artifact.

Requirement documents are the unit of intent

In RPP, every behavior — “listeners can accept a pending invitation”, “rotating a verification key archives the previous active key”, “the server refuses messages to a blocked contact” — is a small Markdown file under requirements, grouped by feature area. Each one carries frontmatter that ties it back to the spec:

---
id: invitations-003
title: Listeners can accept a pending invitation
spec_ref: "10.2, 10.4, 11.2, 12.2"
---

The body describes the expected behavior in plain language: what the tool accepts, what it returns, what error codes it emits, what state transitions it performs. It is detailed enough that two different implementers — human or agent — would produce functionally equivalent code from it.

Critically, the requirement file is the smallest reviewable unit of intent in the repo. When an agent proposes a change, the most important diff in the PR is usually the requirement diff. If that diff is empty and the code diff is large, something is wrong — either the agent is doing unrequested work, or there is a behavior change that should have been promoted to a requirement first.

Requirement tests are the mechanical pin

Every requirement document has a mirrored test file under requirements at the exact same path:

Requirement documentTest file
startup.requirement.mdstartup.requirement.test.ts
003-accept-invitation.requirement.md003-accept-invitation.requirement.test.ts

Each test file has a single top-level Deno.test() named after the requirement id:

Deno.test("req:invitations-003 - Listeners can accept a pending invitation", async (t) => {
await t.step("creates a bilateral contact", async () => { /* ... */ });
await t.step("dispatches an invitation_reply envelope", async () => { /* ... */ });
await t.step("rejects non-pending invitations with E_INVITATION_NOT_PENDING", async () => { /* ... */ });
});

A few details that turn out to matter a lot:

  • Mirrored paths are mandatory, which makes structural coverage analysis trivial — a script can walk both trees and report any requirement without a test, or any test without a requirement, without understanding the contents of either.
  • Tests are integration-scoped. They exercise the real controllers, managers, and repositories against an isolated Deno KV store. They verify what the requirement says, not what the implementation happens to do today.
  • No inline helpers. Reusable test helpers live in helpers, one per file. Test files contain only imports and Deno.test() calls. This sounds like nitpicking but it is load-bearing — it prevents agents from quietly introducing test-local “helpers” that paper over behavior the requirement does not actually mandate.
  • Tautological tests are explicitly banned. Asserting typeof fn === "function" is not a test. Asserting assertEquals(true, true) is not a test. If a test fails, the rule is to diagnose and fix the root cause — never to weaken the assertion until it passes.

The last point is where most of the human-in-the-loop value lives. The instruction file says it directly:

Never reduce a test’s scope or remove assertions to work around a failure. The requirement document is the source of truth — the test must faithfully verify what the requirement states.

Agents will absolutely try to weaken tests when they get stuck. Codifying “no” in a rule the agent reads on every turn is what makes the system robust.

Scenarios for whole-system journeys

Requirement tests verify a single normative behavior in-process. They are not enough on their own, because protocols are emergent — the interesting bugs live in the seams between two users, two domains, and two HTTP round-trips.

So RPP has a second tier of executable spec: scenarios. Each is a Markdown document under scenarios describing a real-world journey in plain English (“Alice opens a receptive window; Justin sends her an invitation; Alice accepts and replies with a message”), paired with a mirrored src/scenarios/**/<name>.scenario.test.ts that drives the journey through the real MCP HTTP endpoint using authenticated persona clients.

Scenarios sit below requirement tests in the authority chain. They complement requirement tests; they do not replace them. If a scenario fails because of a conflict with a higher-authority artifact, the implementation gets fixed — the scenario is not weakened.

Coverage as a continuous artifact

Because the structure is mechanically uniform, coverage can be reported as a continuous, regenerable artifact rather than a manual review pass. Two reports live under reports and are regenerated as the codebase evolves:

  • gap-analysis-report.md — structural coverage. Which RFC sections have requirement docs? Which requirement docs have tests? What is missing?
  • evaluate-report.md — semantic coverage. How thoroughly does each requirement test actually verify the meaning of its requirement document, with concrete suggestions for closing gaps?

Coverage is reported as X/Y requirements covered by tests (Z%) and significant gaps — missing requirements, weak normative language, untested behaviors — are called out explicitly so they can be addressed before they become regressions. The reports overwrite themselves on every run; git history is the audit trail.

This is the part that changes the reviewer’s job. Instead of reading every line of a 2,000-line PR, a reviewer can look at:

  1. The requirement diff (did intent change?).
  2. The gap-analysis diff (did coverage regress?).
  3. The evaluate-report diff (did any requirement get less thoroughly verified?).

…and then spot-check the code only where one of those three signals fires.

The workflow

When adding a new behavior, the order is always:

  1. If the spec does not already mandate the behavior, propose a spec change first.
  2. Add or update a requirement document under requirements with a stable id and a spec_ref back to the relevant spec section(s).
  3. Add or update a mirrored requirement test under requirements that fails until the behavior is implemented.
  4. Implement the behavior in src until the test passes.
  5. Where the behavior spans multiple personas or HTTP round-trips, add a scenario.
  6. Re-run the gap-analysis and evaluate reports and address any regressions.

An agent can do all six steps. A human only needs to deeply read step 2 — and lightly review steps 3 and 5 — to know that the rest of the work is anchored to something they approved.

Where spec-driven flexes: repo-level and implementation-level instructions

The thing I want to push back on, before someone reads the above and concludes “this is just waterfall in a trench coat,” is that the spec only governs what the system does, not how. The repo still needs opinions about how to write the code, and those opinions live in a separate, complementary layer of customization files. They are loaded by the agent on every relevant turn, and they apply alongside the spec, not in tension with it.

In RPP, that layer looks like this:

Instructions

instructions holds rules that apply automatically based on file path globs:

  • general.instructions.md (applyTo: **) — repo-wide rules: Deno-first, JSR over npm, all third-party imports must be mapped in deno.json, Date objects are the canonical internal representation everywhere (strings only exist on serialized boundaries), Zod validates every external boundary, and so on.
  • patterns.instructions.md (applyTo: src/**) — Grove API conventions: how the context initialization chain layers services → repositories → managers, how controllers accept Request and return Response, where MCP tools register.
  • requirement-testing.instructions.md (applyTo: **/*.requirement.*) — the rules described above: mirrored paths, single top-level Deno.test(), no inline helpers, no tautological tests, never weaken a failing test.
  • scenario-testing.instructions.md (applyTo: **/*.scenario.{md,test.ts}) — scenario conventions: persona clients only, HTTP only, fresh server per test.

These instructions are orthogonal to the spec. The spec says “the system must accept invitations.” The instructions say “and when you write the code, use Zod at the boundary, keep dates as Date objects internally, register the tool through Grove’s MCP layer, and follow the mirrored test path convention.” Neither is sufficient without the other.

Skills

skills holds reusable, named procedures the agent can invoke:

  • deno-add-module — the right way to add a dependency: map it in deno.json first, prefer JSR, then import from the bare specifier.
  • gap-analysis — regenerate the structural coverage report.
  • evaluate — regenerate the semantic coverage report.
  • get-rpp-token — acquire a bearer token for client scripts (prefers RPP_TOKEN, falls back to az account get-access-token).

Skills are how you capture “the validated way we do this thing” without inlining it into the spec or the instructions. When an agent needs to add a module, it follows deno-add-module rather than improvising — and when the procedure changes, you update one file and every future agent run picks it up.

Agents and prompts

agents defines specialized subagents (deno-deploy-builder, deno-deploy-reviewer) with narrower personas and tool sets than the default. prompts holds named, reusable prompts (e.g. quality-checks.prompt.md) that bundle a standard check into a single invocation.

These exist so that implementation-specific concerns — “this is a Deno Deploy edge runtime, no Node-only APIs, no local disk state, fast startup” — can be enforced without polluting the spec, which should remain implementation-agnostic. The RPP spec does not say “the reference implementation runs on Deno Deploy.” That fact lives in the customization layer, exactly where it belongs.

Why this scales the reviewer

The combination is what actually makes a human reviewer effective against an agent producing PRs faster than they can read code:

  • The spec is small, slow-changing, and human-authored. Reviewing a spec diff is a focused intellectual task, not a marathon.
  • The requirements are mechanically pinned to the spec by spec_ref and mechanically pinned to tests by mirrored paths. A reviewer can check coverage without reading test bodies.
  • The tests cannot be tautological, cannot share inline helpers, and cannot be weakened to pass. The agent reads those rules on every turn.
  • The scenarios verify the seams between requirements that no single requirement test would catch.
  • The coverage reports make regressions visible at the PR level instead of at the next outage.
  • The instructions, skills, agents, and prompts keep the how — runtime, framework, dependency policy, dev workflow — out of the spec and in a layer that can evolve independently.

The agent does the typing. The human does the deciding. The artifacts mechanically force the typing to match the deciding.

That, in one sentence, is the whole technique.

Try it

The full repo is at github.com/justinmchase/real-polite-protocol. The interesting files to read first, in order, are:

  1. rpp-spec.md — the protocol itself.
  2. requirements — the decomposition of the spec into reviewable units of intent.
  3. requirements — the executable pins.
  4. instructions — the layer that adapts the technique to this particular codebase.
  5. reports — what the continuous coverage artifact actually looks like.

You do not need to adopt all of it at once. The single highest-leverage change — the one that pays for itself the first week — is the authority chain itself. Write it down. Make it the first thing every agent reads. Then start moving behaviors into requirement files one PR at a time.

The agents get faster. The reviewer stays in control. Both need to be true at the same time, and spec-driven development is the cheapest way I have found to make them so.

Bing AI can write code in my custom programming language and its blowing my mind

I’ve been working on a programming language project for quite a while. Its been challenging and fun as my on-again-off-again side project for probably 10+ years.

Needless to say this programming language isn’t known by many people and it is fairly original while borrowing from a few other similarly obscure language projects.

The language is a parser generator language, a programming language programming language if you will. Its intent is to make it easier for people to create DSLs and the accompanying tools. You can check it out at https://github.com/justinmchase/uffda

So I figured it would be an interesting opportunity to see how good the new Big AI actually is by asking it to do something cognitively difficult for a human and which there is little to no literature available online for it to have been heavily trained in prior to my request.

So I asked it to “write a program in the Uffda language that parses Morse Code?”

It gave me a little guff saying that Uffda is actually a scandanavian expression, which is closer to what I expected it to do. But then it went on to say that its actually a TypeScript library for a parser generator and offered “If you want, I can write a program in TypeScript that uses the Uffda library to define a parser for Morse Code.😊”

It then gave me the option to press a button to “Yes, please write the program.”

And so I did… And the results kind of astonished me. It actually wrote what appears to be a viable parser and beyond that it took advantage of a few concepts which really suprised me:

  • The TypeScript code was correct (unsprisingly)
  • The Uffda code inside the string inside TypeScript is correct (a language inside a language!)
  • Its using Direct Left Recursion which is a pretty complex concept
  • It added completely accurate and relevant code comments
  • It actually included an example of running the code, as well as the expected output in a code comment, which implies that it simulated running the program of a language… even if it isn’t actually capable of doing that giving the impression it did is still pretty amazing.

The Code

// Import the Uffda library
import { dsl, uffda } from "https://deno.land/x/uffda/mod.ts";

// Define a parser for Morse Code using Uffda syntax
export const Morse = await uffda()`
  // A dot is represented by a single dot character
  Dot = ({ kind = 'Token', value = '.' } -> '.');
  
  // A dash is represented by a single dash character
  Dash = ({ kind = 'Token', value = '-' } -> '-');
  
  // A letter is composed of one or more dots and dashes
  Letter 
    = (l:Letter Dot r:Letter -> l + '.' + r) // Concatenate dots and dashes
    | (l:Letter Dash r:Letter -> l + '-' + r)
    | Dot // A single dot is also a letter
    | Dash // A single dash is also a letter
    ;
  
  // A word is composed of one or more letters separated by spaces
  Word 
    = (l:Word { kind = 'Token', value = ' ' } r:Letter -> l + ' ' + r) // Concatenate letters with spaces
    | Letter // A single letter is also a word
    ;
  
  // A sentence is composed of one or more words separated by slashes
  Sentence 
    = (l:Sentence { kind = 'Token', value = '/' } r:Word -> l + '/' + r) // Concatenate words with slashes
    | Word // A single word is also a sentence
    ;

  Main 
    > Sentence; // The main parser expects a sentence as input

`;

// Create a DSL function that parses Morse Code using the defined parser
export const morse = dsl(import.meta.url, Morse);

// Parse some Morse Code and print the result
const { value } = morse`.... . .-.. .-.. --- / .-- --- .-. .-.. -..`;
console.log(value); // Prints ".... . .-.. .-.. --- / .-- --- .-. .-.. -."

Now as I’m expanding this I can see it actually got the code a little bit wrong. The last line of the parser is the Main rule which should actually be…

Main = ${Basic} > Sentence;

So I told it that it made a mistake and forgot to import the Basic pattern and pipe it into the Sentence pattern. I asked it to correct and reprint the program…

It emitted the same program again but this time it actually included the missing Basic pattern import.

However it did completely omit the Main rule as well. So I gave it another hint and asked it to try again.

And so this time it printed it out but still not fully correct (> instead of =).

Also, somewhat astonishing here is that the code comment is saying “The main parser expects a sentence as input after applying the Basic pattern”, which is correct despite me calling it a “rule” in the previous comment (A rule is a kind of a pattern in this language).

Moving on I asked it to print out each letter as Ascii instead of printing out the input dot’s and dashes. Here it made a pretty big mistake by adding a switch statement which isn’t even valid syntax in uffda.

So I then gave it a hint about pattern matching and the correct syntax and it had pretty amazing results.

It sort of broke down around the Letter O and stopped printing spaces between the patterns, the Z pattern completely stopped using Dot and Dash patterns and just printed out literal dots and dashes. But honestly I’m still pretty impressed with the results.

Asking it to fix the O through Z patterns by adding spaces seems to not be working. Asking it to use Dot and Dash pattern references instead of verbatim . and – characters seems to have exacerbated the problem. So at this point I think we hit the edge of its capabilities.

That being said I’m highly impressed. For it to be able to write a language within a language is pretty surprising, especially for a language that it couldn’t possibly have had much training data on.

Being able to talk to it naturally and have it build context is a really nice way to work with a search engine. Being able to simply correct it and have it seem to grow its understanding of the programming lanaguage as we went felt really natural. I honestly have given many code reviews that felt pretty similar to this conversation I had with the AI and that went less well too.

I do feel like I hit a limit to its capabilities but its very impressive nonetheless. I hope someday I will be able to essentially give the AI more and more constraints in a natural way and have it fully write the code needed to make it work.

Hybrid Microservice Architecture

In the article What are Microservices, Amazon Web Services does a great job of defining Microservices.

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. These services are owned by small, self-contained teams.

Additionally, they do an excellent job of outlining the many strengths of Microservices as contrasted to Monolithic applications.

With a microservices architecture, an application is built as independent components that run each application process as a service. These services communicate via a well-defined interface using lightweight APIs.

They make many excellent points and do a good job of highlighting true strengths of microservice architectures.

However, what AWS doesn’t mention in this article are the downsides. If you were to read only this article you would be led to believe that there are no downsides and that all is rainbows and unicorns in the land of microservices, and you’d be a fool to do anything else.

Yet, when you actually attempt to develop a system as microservices there are some clear downsides which are hard to ignore and its common to find yourself wishing someone would have explained these downsides along with the good.

So here they are, I will outline a couple of the biggest problems with microservices and then I’d like to propose a way to help reduce the effects of these downsides with what I call a Hybrid Microservices Architecture.

Notable Microservice Downsides

  1. Developer context switching
  2. Maintenance costs
  3. Communication costs
  4. Duplication

Developer Context Switching

Developers frequently need to switch between areas of code in the same application. The feature they may be working on may span multiple, or all, microservices in an application. They may have a bug which is blocking them but appears in a different teams microservice. They may have tight coupling between several microservices, and they need them running in tandem in their development environments.

In all of these cases a developer may need to clone, understand, build and run the code for multiple microservices simultaneously. That set of microservices tends to grow over time as well. Most of the time this is a pretty reasonable thing to do but it’s hard to deny that for each microservice you need to work with in this way more and more cognitive complexity is loaded onto the developer. Eventually there is a tipping point where it can feel “too hard” and the time it takes to get the code to a point where you can work on it feels very slow.

Maintenance Costs

Each Microservice has a cost. Both in actual server and network costs but also in terms of code maintenance. For example, you may have multiple microservices all with a reference to the same library which needs to be updated. That work has to be duplicated across multiple repos, multiple teams.

Each microservice may have its own deployment pipeline, or its own testing infrastructure, its own linting rules, its own repo configuration, its own secrets to rotate.

Even if you utilize shared libraries or tools, when you make a feature improvement for the tool you have to roll out the new version to every independent repository. The patterns and the code are duplicated across multiple repositories, increasing the effort needed to make improvements.

Communication Costs

Because microservices are independent processes, they need to communicate with each other through their respective APIs. This communication requires careful specification of inputs and outputs as well as versioning and backward compatible support for other services which may lag behind on their own updates.

In a monolithic application this communication may simple be between two functions in the same code and a simple refactor is all that’s needed. But microservices cannot be monolithically refactored, they need to carefully coordinate their breaking changes and use additive techniques and deprecation schedules.

Additionally, communicating over a network effectively may require special communication patterns such as Queues or Streams to manage throughput and reliability. And further, special considerations may need to be put into place to prevent circular messages. This has a monetary and cognitive complexity cost.

Duplication

Microservices are typically isolated; they have their own database and services such as Queues, Streams and File Storage will likely be considered an internal implementation detail of the service. That infrastructure will need to be duplicated for each service.

Many times, a service will rely on the data owned by another API, but because those services are isolated, they must not access each other’s internal infrastructure directly. So, they go through the API of the other service to fetch or stream data and duplicate it into their own system. Both the storage and the code needed to do this duplication of data has a non-trivial cost.

Hybrid Microservices

Despite all of these downsides the decision tree to choose a microservice architecture vs. a monolithic architecture is still usually pretty clear; yes, you should do microservices, it is worth the cost.

However, I would like to share an additional approach which I am calling a Hybrid Microservice Architecture which I believe can help reduce the costs of developing microservices.

The primary idea of a Hybrid Microservice is to strike a balance between the costs of a microservice and the strengths of a monolith. The strength of the monolith, despite all its problems, is that the code is consolidated together in a single repository.

The second idea of the hybrid architecture is to define the boundary of a microservice not to be based on features or data but rather teams. Each team will consolidate all of the code of their microservices into a single repository.

Yet even though we’re consolidating our code into single repositories per team, I am not suggesting we should simply have multiple smaller monoliths.

Rather, the hybrid approach should adopt a “Mode” pattern which will allow this single repository to run in multiple modes. This will allow us to structure the code into multiple runtime components, which can continue to leverage the strengths of microservices and scale independently based on features and usage.

Mode Pattern

The mode pattern is simple, rather than having a single entry point, the service declares all possible modes that it can run as, and command line arguments are used to cause the process to run in a single mode. Each process can only run in a single mode.

Example modes:

  • Web Server
  • Cron Job
  • Function or Lambda
  • Stream Handler

The advantage of this is that the business logic of the application is shared in the same code base for all modes of the application. The models, the repositories the utilities, etc. You don’t need to an extra repository to create the shared code between them and increase the maintenance and cognitive costs of publishing a library and referencing it across multiple other repositories.

Additionally, since the code is identical between all processes running in the same microservice (just in different modes) you can safely communicate with shared data storage, such as the database. Normally there is a tension between two applications calling the same database directly because their code may be different and altering the schema of data that another application depends on can cause issues with the other applications. When this happens your database becomes an API in and of itself and can become extremely difficult to change safely.

Additionally, facilitating network-based communication between two microservices of the same team can be expensive, feel unnecessary and be slow. Requiring a Cron Job to call into web APIs to simply to get the data it needs to do some work can be very prohibitive, especially when it needs to crunch large amounts of data.

Therefore, the hybrid solution posits this hypothesis:

It’s safe for multiple microservices to access the same database directly provided they have identical code at all times.

And here is the summary in bullet point format:

  • Organize into as few of teams as possible
  • Each team has a single repository for all of their code
  • Each team repository implements all required Modes
  • Each microservice owned by the same team runs the same code
  • Each microservice runs a single mode
  • Each microservice running the same code is safe to access the same internal services

Grove

I have made a public, Open Source Hybrid Microservice framework called Grove for use in code or as a proof of concept.

I contributed to Deno

I’ve been developing with Node.js for a while now and I have been enjoying it quite a bit, despite some of its flaws. Overall its been a great experience and I am a big node.js fan.

Much to my surprise however a new project called Deno has emerged. After taking a look at it I realized it is essentially a spiritual successor to node.js but it’s also better than node in pretty much every single way… Well, every way except for its ubiquity! There are tons of high quality open source modules for node which just don’t quite work with Deno out of the box. Deno took a hard line stance on its adoption of ESM modules, which is actually better than common.js, and enables a variety of other features such as not needing npm at all anymore… Its just that ESM is not very wide spread and is only backwards compatible with commonjs modules with some hacks that only work about 75% of the time it seems.

Deno also has a few areas which are still underdevelopment related to certs, TLS and websockets. But fortunately the project has a very active and responsive team of developers! I noticed an issue I was having related to connecting to an internal site due to my CA certificates not being loaded and took the time to debug it. Eventually I figured out that the propery CA cert was stored in my systems keystore and Deno couldn’t find it there. So I managed to find a simple rust crate which supported loading certificates right out of the keystores for each major platform and figured out it was pretty trivial to integrate it in with the crates Deno was already using to do TLS! The Deno developers worked with me to craft the proper changes and do some necessary refactoring and testing, and now I am a Deno contributor.

Here’s my commit:

https://github.com/denoland/deno/commit/02c74fb70970fcadb7d1e6dab857eeb2cea20e09

More

https://github.com/denoland/deno_std/commit/396445052d25b206e0adb00826c7365783fa578a