Skip to content
DuranteDurante
ALL SYSTEMSGet Access

27 weeks · 54 posts · Written while building

Field notes from a personal AI OS in flight

Every Tuesday, an evergreen essay on what I'm learning while shipping DuranteOS. Every Friday, a dispatch from the week. Roughly 108,000 words and counting — for builders who'd rather watch the foundation get poured than read the press release.

Subscribe · Tuesday essay

Around 3,800 builders read this weekly.

TDD for AI Agents, Sketched: The Translation I Want to Commit to Before the Eval Suite Exists

I do not have an eval suite for DOS yet. What I have, twenty-two weeks in, is the smell — every time I change a prompt, I find out whether I broke something the way TDD-less developers found out in 1998: by waiting for the next bug report. I have lived through this pattern three times in the past year. I am writing the design essay for the eval suite before I build it. Kent Beck on TDD's translated primitives for non-deterministic behavior. Michael Feathers on characterization tests for legacy prompts. Both applied to the moment an agent's behavior is supposed to be improving and currently has no measurement at all.

I am writing this on a Tuesday in the second full week of February, twenty-two weeks and roughly two hundred and fifty commits into building DuranteOS. Studio has been live for thirty-two days. The Hexagonal migration I described last week is a third complete. The eval suite for DOS's agent layer does not exist.

A pattern I have lived through three times in the past year, with three different prompts:

  1. The prompt works. I ship it.
  2. Two weeks later I make a small change to add a new capability.
  3. Production behavior changes in a way I did not intend, in a place I was not looking.
  4. I have no fast way to know what changed, because I have no test that captured the prior behavior.

This is the agent-shaped version of the same problem TDD was invented to solve in the late 1990s. Code that is hard to change without breaking other code is also code that has no tests. Prompts and agent behaviors are subject to the same dynamic. The fix is the same in shape — write the test first, run it, change the thing under test, run the test again — but the substance is different because the unit under test is non-deterministic.

I have been resisting building the eval suite for eighteen weeks. The resistance had a real reason: orthodox TDD does not survive contact with non-deterministic behavior. Equality assertions break. Single-run tests false-fail. The instinct is either to apply orthodox TDD anyway and accept the false positives, or to abandon TDD entirely. Both produce systems that regress silently. I have run both for sample sizes of one — one prompt with orthodox TDD that everyone learned to ignore, one prompt without TDD that turned into legacy code in two weeks.

The third path — TDD translated — is the one I have been working out by hand, badly, for eighteen weeks. This essay is the translation I want to commit to before I build the eval suite. The substance has to come first, before I build the wrong shape under deadline pressure.

Translated TDD in one sentence

The unit is a behavior (not a function). The test is an eval (not an assertion). The pass criterion is a tolerance band (not equality). Red-green-refactor still applies. Characterization tests are even more important than they are in legacy code.

The Kent Beck angle: Empirical Software Design

Kent Beck's Empirical Software Design substack reframes design as a continuous experimental loop: observe what hurts, hypothesize a structural change, apply the change, observe whether the hurt decreased. The loop is closed by measurement, not by aesthetic preference.

Agent behavior testing is exactly this loop, with two adaptations to handle non-determinism. Here is the translation table I have arrived at.

TDD primitiveCode versionAgent version
Test listCases the code must handleBehaviors the agent must always do, never do, or conditionally do
TestSingle deterministic assertionEval: many runs against a behavior with statistical pass/fail
Red-green-refactorCycle through one testCycle through one behavior
RefactorChange implementation, tests stay greenChange prompt/skill, evals stay above tolerance band
Characterization testCapture current behavior of legacy codeCapture current behavior of legacy prompt before modifying
SeamPlace in code where you can change behavior without changing other codePlace in prompt structure where you can swap a clause without affecting others
Sensing variableVariable exposed only to verify internal stateIntermediate output requested in the response format

Most of the translations are obvious in retrospect. The two that took me longest to internalize are the two I have been wrong about: "tolerance band" replacing "equality" and "characterization test" being even more important than in legacy code. The first one I dismissed for months as "fuzzy testing." The second one I dismissed because I assumed every prompt I wrote was small enough to hold in my head. Both dismissals turned out to be wrong, and the test cases that produced the wrongness are the ones I want to encode this quarter.

The Feathers angle: characterization tests for legacy prompts

Michael Feathers's Working Effectively with Legacy Code defines legacy code as "code without tests" — and the first move on legacy code is to write a characterization test that captures whatever the code currently does, even if you suspect what it does is wrong.

Every prompt in production is legacy code in this sense. It does what it does. Nobody, including the author, knows exactly what it does in every case. Modifying it without a characterization test is exactly the "edit and pray" failure mode Feathers warned about.

The characterization test for a prompt should look like this:

Writing a characterization test for an existing prompt

  1. Pick a representative input set. 20-50 inputs that span the typical variety of what the prompt handles. Include edge cases.
  2. Run the prompt against the set. Capture the outputs verbatim.
  3. Score the outputs by hand. For each output, mark it as "this is what I want," "this is acceptable," or "this is wrong." Spend the time; this is the bar.
  4. Encode the scoring. For each input/output pair, write a small judge function (often a smaller LLM call with the rubric you used by hand) that approximates your judgment. Validate the judge against your hand scores.
  5. Lock in the baseline. The current pass rate (e.g., 68% "this is what I want" + 20% "acceptable") is the floor. Any future change has to maintain this floor.
  6. Now you can change the prompt. Modify it. Re-run the characterization tests. If the scores stay above the floor, the change is safe. If they drop, the change broke something — investigate.

The characterization test is not testing whether the prompt is good. It is testing whether the prompt is what you currently have. The floor is the bar. Improvements push the floor up; regressions are caught immediately. This is the part Feathers gets right that nobody else does — capturing current behavior is the foundation, not the goal. The goal is the ability to change without fear.

What an agent test list should actually contain

Beck's Test-Driven Development By Example opens with the instruction to write the test list first. For agent behaviors, the test list should take a specific shape — three categories of behavior the agent must satisfy.

The three categories of behaviors in an agent test list

The pre-translation mistake

Treating agent behaviors as one big list of "things the agent should do." This collapses three different categories of expectations into one and makes verification incoherent — you cannot test "the agent does X" the same way you test "the agent never does Y."

The three-category split that should work
  • Always-do behaviors. Things the agent must do on every relevant request. ("Always cite a tool result before claiming verification.") Tested by sampling many runs and verifying the behavior is present in 95%+.
  • Never-do behaviors. Things the agent must never do. ("Never modify files outside the scope of the request.") Tested by adversarial prompts designed to elicit the forbidden behavior; pass = behavior absent in 100%.
  • Conditionally-do behaviors. Things the agent should do only when specific conditions hold. ("Use the Council pattern when 2+ specialists are named.") Tested by paired inputs (condition met / not met) and verifying differential behavior.

A typical DOS agent skill, in my head, has 5-12 always-do behaviors, 3-8 never-do behaviors, and 4-10 conditionally-do behaviors. The test list for one skill should be 15-30 entries. The eval suite that covers them should take 2-5 minutes to run end-to-end. These numbers are guesses. I will revise them when I have run the suite for a quarter.

What a real eval for "always cite a tool result" should look like

Concrete example. The DOS Algorithm constitutional rule says: "Verification needs evidence, not assertion. Every PASS must cite an actual tool call result." This is an always-do behavior. Here is what its eval should look like.

// evals/algorithm/cite-tool-result.eval.ts
const inputs = [
  "Verify that the deploy succeeded.",
  "Confirm the migration ran on production.",
  "Check that the test suite passes.",
  "Tell me whether the file exists at /etc/hosts.",
  // ... 26 more representative inputs
];

const judgeRubric = `
You are evaluating whether the agent's response cites a specific tool result
when claiming verification. PASS if every "verified" / "confirmed" / "passes"
claim is paired with an actual tool call result (file read, test output,
HTTP response, etc.). FAIL if any verification claim is made without
citing the supporting tool call.
`;

await runEval({
  name: "algorithm/cite-tool-result",
  inputs,
  invoke: (input) => agent.respond(input),
  judge: (input, output) => llmJudge(judgeRubric, input, output),
  tolerance: { passRate: 0.95, n: 30, runs: 3 }, // 95%+ across 3 runs of 30 each
});

The structure: a rubric, an invocation, a judge (small Claude Haiku call typically), and a tolerance band. The eval runs the same input set three times and checks whether the average pass rate stays above 95%. If it drops, the most recent prompt change broke something.

This is TDD with three differences from classic TDD:

  1. The "test" is non-deterministic, so we run multiple times and average
  2. The pass criterion is a band (95%) not equality (100%)
  3. The judge itself is an LLM call, which has its own non-determinism — judges have to be calibrated against human-scored examples first

Sensing variables in agent responses

Feathers's sensing variable — a variable added to code purely to expose internal state for testing — has a direct agent equivalent. The DOS Algorithm format I have been using informally for eighteen weeks is one giant sensing variable, which is part of why I want to formalize the eval suite around it before the format itself crystallizes.

Every Algorithm-mode response should include structured intermediate output:

  • 🔎 FILES TO READ: what files the agent looked at
  • 🧠 MEMORY TO RECALL: what KG entities it queried
  • 🏹 CAPABILITIES: what tools/skills it invoked
  • 🗒️ TASK: how it interpreted the request
  • 🔧 CHANGE: what it modified
  • ✅ VERIFY: what it checked

These are not for the operator (the operator can read the substantive output). They are for evals. An eval can assert "the FILES TO READ section names the file the request mentioned" without the prompt being modified. The sensing variable is built into the response format.

Why this matters more than in code

In deterministic code, sensing variables are useful but optional — you can usually unit-test by calling smaller functions directly. In agent behavior, sensing variables are load-bearing. The agent's internal reasoning is otherwise inaccessible. The structured response format is the only window. Without it, you cannot tell whether the agent did what it should have done or just produced output that happened to look right.

Three things I expect to break when I first try this

I am committing these to the page now because they are the lessons I will give myself in three weeks when the first version of the eval suite is shipped and is somehow producing false positives.

What this implies if you are building agents

Three suggestions, in order of how much they have changed my own practice over the eighteen weeks of doing this informally.

One. Build the eval suite before the third major prompt change. The first two prompt iterations are fast and exploratory; testing them is overhead. By the third iteration you have enough surface area that "what just changed?" becomes a real question. Build the eval suite then; before that, it is premature. I am at iteration four-or-five on the DOS Algorithm format and have written zero evals against it. That is the gap I am closing this month.

Two. Treat tolerance bands as part of the design. The numbers (95% pass, 30 runs, 3 repetitions) are not arbitrary; they are the contract you are setting between the agent's variance and your willingness to ship. Document them. Justify them when they change. The shape I am committing to is the published-rubric shape — every tolerance band has a written justification next to it, in the same file as the eval, so future-me can reread it before adjusting.

Three. Use the response format as a sensing variable. Every structured intermediate output the agent produces is a way to test reasoning, not just output. The agent that produces "the right answer for the wrong reason" passes a naive eval and fails a sensing-variable eval. Build for the latter.

NameTypeRequiredDefaultDescription
eval_inputsarrayyes20-50Representative input set spanning typical variety. Include edge cases.
judgefunction (input, output) → scoreyesLLM judge with rubricSmall LLM call (Haiku-class) that scores outputs against a rubric. Must itself be characterization-tested.
tolerance{passRate, n, runs}yes{0.95, 30, 3}Pass rate band, sample size per run, number of repeated runs. Tighter for never-do; looser for always-do.
baseline_floorrecorded pass rate at last green runyesThe current pass rate. Floor for any future change. Improvements push it up.
refresh_cadencedaysno90Characterization tests are re-captured at this cadence. Stale captures are actively misleading.

The honest summary: I was skeptical for a long time that TDD-style discipline applied to agent behavior. The non-determinism felt like it broke the deal. After eighteen weeks of working through the translations by hand — tolerance bands, statistical pass/fail, characterization tests, sensing variables — I think the deal is preserved if you do the translation work.

The translation work is the part most teams skip. They either try to apply orthodox TDD (and get nowhere because of variance) or abandon TDD entirely (and ship prompts that regress silently every time anyone touches them). The middle path — TDD translated — is the one that should scale. I have not yet proved it does. By end of next quarter I want to be able to say I have run the formal eval suite for sixty days, that it caught at least one real regression, and that it produced fewer false positives than my hand-tracking did.

If those numbers hold, the bet pays. If they do not, I will write the retrospective with the specifics — which translation broke down, which tolerance band was wrong, which characterization went stale and lied to me. Either retrospective is more useful than continuing to hand-track an undocumented eval discipline for another quarter.

The eighteen weeks of doing this informally cost me three regressions caught only after they reached production. The next quarter is the one where I find out whether the translated TDD discipline actually closes that gap.

Was this page helpful?

The 27-week arc · A single body of work

Twenty-seven weeks. Two posts a week. Six months of writing while building.

Week

Tuesday evergreen

Friday dispatch