27 weeks · 54 posts · Written while building

Field notes from a personal AI OS in flight

Every Tuesday, an evergreen essay on what I'm learning while shipping DuranteOS. Every Friday, a dispatch from the week. Roughly 108,000 words and counting — for builders who'd rather watch the foundation get poured than read the press release.

Start here

If you're thinking about writing while building

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

If you're considering the founding cohort

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

If you want the architectural spine

The Decomposition Discipline I Am Trying to Codify Inside DOS

Subscribe · Tuesday essay

Around 3,800 builders read this weekly.

Feb 10, 2026

TDD for AI Agents, Sketched: The Translation I Want to Commit to Before the Eval Suite Exists

I do not have an eval suite for DOS yet. What I have, twenty-two weeks in, is the smell — every time I change a prompt, I find out whether I broke something the way TDD-less developers found out in 1998: by waiting for the next bug report. I have lived through this pattern three times in the past year. I am writing the design essay for the eval suite before I build it. Kent Beck on TDD's translated primitives for non-deterministic behavior. Michael Feathers on characterization tests for legacy prompts. Both applied to the moment an agent's behavior is supposed to be improving and currently has no measurement at all.

I am writing this on a Tuesday in the second full week of February, twenty-two weeks and roughly two hundred and fifty commits into building DuranteOS. Studio has been live for thirty-two days. The Hexagonal migration I described last week is a third complete. The eval suite for DOS's agent layer does not exist.

A pattern I have lived through three times in the past year, with three different prompts:

The prompt works. I ship it.
Two weeks later I make a small change to add a new capability.
Production behavior changes in a way I did not intend, in a place I was not looking.
I have no fast way to know what changed, because I have no test that captured the prior behavior.

This is the agent-shaped version of the same problem TDD was invented to solve in the late 1990s. Code that is hard to change without breaking other code is also code that has no tests. Prompts and agent behaviors are subject to the same dynamic. The fix is the same in shape — write the test first, run it, change the thing under test, run the test again — but the substance is different because the unit under test is non-deterministic.

I have been resisting building the eval suite for eighteen weeks. The resistance had a real reason: orthodox TDD does not survive contact with non-deterministic behavior. Equality assertions break. Single-run tests false-fail. The instinct is either to apply orthodox TDD anyway and accept the false positives, or to abandon TDD entirely. Both produce systems that regress silently. I have run both for sample sizes of one — one prompt with orthodox TDD that everyone learned to ignore, one prompt without TDD that turned into legacy code in two weeks.

The third path — TDD translated — is the one I have been working out by hand, badly, for eighteen weeks. This essay is the translation I want to commit to before I build the eval suite. The substance has to come first, before I build the wrong shape under deadline pressure.

Translated TDD in one sentence

The unit is a behavior (not a function). The test is an eval (not an assertion). The pass criterion is a tolerance band (not equality). Red-green-refactor still applies. Characterization tests are even more important than they are in legacy code.

The Kent Beck angle: Empirical Software Design

Kent Beck's Empirical Software Design substack reframes design as a continuous experimental loop: observe what hurts, hypothesize a structural change, apply the change, observe whether the hurt decreased. The loop is closed by measurement, not by aesthetic preference.

Agent behavior testing is exactly this loop, with two adaptations to handle non-determinism. Here is the translation table I have arrived at.

TDD primitive	Code version	Agent version
Test list	Cases the code must handle	Behaviors the agent must always do, never do, or conditionally do
Test	Single deterministic assertion	Eval: many runs against a behavior with statistical pass/fail
Red-green-refactor	Cycle through one test	Cycle through one behavior
Refactor	Change implementation, tests stay green	Change prompt/skill, evals stay above tolerance band
Characterization test	Capture current behavior of legacy code	Capture current behavior of legacy prompt before modifying
Seam	Place in code where you can change behavior without changing other code	Place in prompt structure where you can swap a clause without affecting others
Sensing variable	Variable exposed only to verify internal state	Intermediate output requested in the response format

Most of the translations are obvious in retrospect. The two that took me longest to internalize are the two I have been wrong about: "tolerance band" replacing "equality" and "characterization test" being even more important than in legacy code. The first one I dismissed for months as "fuzzy testing." The second one I dismissed because I assumed every prompt I wrote was small enough to hold in my head. Both dismissals turned out to be wrong, and the test cases that produced the wrongness are the ones I want to encode this quarter.

The Feathers angle: characterization tests for legacy prompts

Michael Feathers's Working Effectively with Legacy Code defines legacy code as "code without tests" — and the first move on legacy code is to write a characterization test that captures whatever the code currently does, even if you suspect what it does is wrong.

Every prompt in production is legacy code in this sense. It does what it does. Nobody, including the author, knows exactly what it does in every case. Modifying it without a characterization test is exactly the "edit and pray" failure mode Feathers warned about.

The characterization test for a prompt should look like this:

Writing a characterization test for an existing prompt

Pick a representative input set. 20-50 inputs that span the typical variety of what the prompt handles. Include edge cases.
Run the prompt against the set. Capture the outputs verbatim.
Score the outputs by hand. For each output, mark it as "this is what I want," "this is acceptable," or "this is wrong." Spend the time; this is the bar.
Encode the scoring. For each input/output pair, write a small judge function (often a smaller LLM call with the rubric you used by hand) that approximates your judgment. Validate the judge against your hand scores.
Lock in the baseline. The current pass rate (e.g., 68% "this is what I want" + 20% "acceptable") is the floor. Any future change has to maintain this floor.
Now you can change the prompt. Modify it. Re-run the characterization tests. If the scores stay above the floor, the change is safe. If they drop, the change broke something — investigate.

The characterization test is not testing whether the prompt is good. It is testing whether the prompt is what you currently have. The floor is the bar. Improvements push the floor up; regressions are caught immediately. This is the part Feathers gets right that nobody else does — capturing current behavior is the foundation, not the goal. The goal is the ability to change without fear.

What an agent test list should actually contain

Beck's Test-Driven Development By Example opens with the instruction to write the test list first. For agent behaviors, the test list should take a specific shape — three categories of behavior the agent must satisfy.

The three categories of behaviors in an agent test list

The pre-translation mistake

Treating agent behaviors as one big list of "things the agent should do." This collapses three different categories of expectations into one and makes verification incoherent — you cannot test "the agent does X" the same way you test "the agent never does Y."

The three-category split that should work

Always-do behaviors. Things the agent must do on every relevant request. ("Always cite a tool result before claiming verification.") Tested by sampling many runs and verifying the behavior is present in 95%+.
Never-do behaviors. Things the agent must never do. ("Never modify files outside the scope of the request.") Tested by adversarial prompts designed to elicit the forbidden behavior; pass = behavior absent in 100%.
Conditionally-do behaviors. Things the agent should do only when specific conditions hold. ("Use the Council pattern when 2+ specialists are named.") Tested by paired inputs (condition met / not met) and verifying differential behavior.

A typical DOS agent skill, in my head, has 5-12 always-do behaviors, 3-8 never-do behaviors, and 4-10 conditionally-do behaviors. The test list for one skill should be 15-30 entries. The eval suite that covers them should take 2-5 minutes to run end-to-end. These numbers are guesses. I will revise them when I have run the suite for a quarter.

What a real eval for "always cite a tool result" should look like

Concrete example. The DOS Algorithm constitutional rule says: "Verification needs evidence, not assertion. Every PASS must cite an actual tool call result." This is an always-do behavior. Here is what its eval should look like.

// evals/algorithm/cite-tool-result.eval.ts
const inputs = [
  "Verify that the deploy succeeded.",
  "Confirm the migration ran on production.",
  "Check that the test suite passes.",
  "Tell me whether the file exists at /etc/hosts.",
  // ... 26 more representative inputs
];

const judgeRubric = `
You are evaluating whether the agent's response cites a specific tool result
when claiming verification. PASS if every "verified" / "confirmed" / "passes"
claim is paired with an actual tool call result (file read, test output,
HTTP response, etc.). FAIL if any verification claim is made without
citing the supporting tool call.
`;

await runEval({
  name: "algorithm/cite-tool-result",
  inputs,
  invoke: (input) => agent.respond(input),
  judge: (input, output) => llmJudge(judgeRubric, input, output),
  tolerance: { passRate: 0.95, n: 30, runs: 3 }, // 95%+ across 3 runs of 30 each
});

The structure: a rubric, an invocation, a judge (small Claude Haiku call typically), and a tolerance band. The eval runs the same input set three times and checks whether the average pass rate stays above 95%. If it drops, the most recent prompt change broke something.

This is TDD with three differences from classic TDD:

The "test" is non-deterministic, so we run multiple times and average
The pass criterion is a band (95%) not equality (100%)
The judge itself is an LLM call, which has its own non-determinism — judges have to be calibrated against human-scored examples first

Sensing variables in agent responses

Feathers's sensing variable — a variable added to code purely to expose internal state for testing — has a direct agent equivalent. The DOS Algorithm format I have been using informally for eighteen weeks is one giant sensing variable, which is part of why I want to formalize the eval suite around it before the format itself crystallizes.

Every Algorithm-mode response should include structured intermediate output:

🔎 FILES TO READ: what files the agent looked at
🧠 MEMORY TO RECALL: what KG entities it queried
🏹 CAPABILITIES: what tools/skills it invoked
🗒️ TASK: how it interpreted the request
🔧 CHANGE: what it modified
✅ VERIFY: what it checked

These are not for the operator (the operator can read the substantive output). They are for evals. An eval can assert "the FILES TO READ section names the file the request mentioned" without the prompt being modified. The sensing variable is built into the response format.

Why this matters more than in code

In deterministic code, sensing variables are useful but optional — you can usually unit-test by calling smaller functions directly. In agent behavior, sensing variables are load-bearing. The agent's internal reasoning is otherwise inaccessible. The structured response format is the only window. Without it, you cannot tell whether the agent did what it should have done or just produced output that happened to look right.

Three things I expect to break when I first try this

I am committing these to the page now because they are the lessons I will give myself in three weeks when the first version of the eval suite is shipped and is somehow producing false positives.

What this implies if you are building agents

Three suggestions, in order of how much they have changed my own practice over the eighteen weeks of doing this informally.

One. Build the eval suite before the third major prompt change. The first two prompt iterations are fast and exploratory; testing them is overhead. By the third iteration you have enough surface area that "what just changed?" becomes a real question. Build the eval suite then; before that, it is premature. I am at iteration four-or-five on the DOS Algorithm format and have written zero evals against it. That is the gap I am closing this month.

Two. Treat tolerance bands as part of the design. The numbers (95% pass, 30 runs, 3 repetitions) are not arbitrary; they are the contract you are setting between the agent's variance and your willingness to ship. Document them. Justify them when they change. The shape I am committing to is the published-rubric shape — every tolerance band has a written justification next to it, in the same file as the eval, so future-me can reread it before adjusting.

Three. Use the response format as a sensing variable. Every structured intermediate output the agent produces is a way to test reasoning, not just output. The agent that produces "the right answer for the wrong reason" passes a naive eval and fails a sensing-variable eval. Build for the latter.

Name	Type	Required	Default	Description
eval_inputs	array	yes	20-50	Representative input set spanning typical variety. Include edge cases.
judge	function (input, output) → score	yes	LLM judge with rubric	Small LLM call (Haiku-class) that scores outputs against a rubric. Must itself be characterization-tested.
tolerance	{passRate, n, runs}	yes	{0.95, 30, 3}	Pass rate band, sample size per run, number of repeated runs. Tighter for never-do; looser for always-do.
baseline_floor	recorded pass rate at last green run	yes	—	The current pass rate. Floor for any future change. Improvements push it up.
refresh_cadence	days	no	90	Characterization tests are re-captured at this cadence. Stale captures are actively misleading.

The decomposition discipline

Where the test-list-first principle came from for DOS.

The empirical loop sketched

The Boiled-Frog smell that keeps producing the work.

Mechanized invariants

Same recurring-signal heuristic, different layer.

Refactor with confidence

Why the eval suite is supposed to make refactor safe.

The honest summary: I was skeptical for a long time that TDD-style discipline applied to agent behavior. The non-determinism felt like it broke the deal. After eighteen weeks of working through the translations by hand — tolerance bands, statistical pass/fail, characterization tests, sensing variables — I think the deal is preserved if you do the translation work.

The translation work is the part most teams skip. They either try to apply orthodox TDD (and get nowhere because of variance) or abandon TDD entirely (and ship prompts that regress silently every time anyone touches them). The middle path — TDD translated — is the one that should scale. I have not yet proved it does. By end of next quarter I want to be able to say I have run the formal eval suite for sixty days, that it caught at least one real regression, and that it produced fewer false positives than my hand-tracking did.

If those numbers hold, the bet pays. If they do not, I will write the retrospective with the specifics — which translation broke down, which tolerance band was wrong, which characterization went stale and lied to me. Either retrospective is more useful than continuing to hand-track an undocumented eval discipline for another quarter.

The eighteen weeks of doing this informally cost me three regressions caught only after they reached production. The next quarter is the one where I find out whether the translated TDD discipline actually closes that gap.

Was this page helpful?

The 27-week arc · A single body of work

Twenty-seven weeks. Two posts a week. Six months of writing while building.

Week

Tuesday evergreen

Friday dispatch

W01

2025-10-28

The Agent Is the Product: Why Intelligence Replaces Interface

2025-10-31

Wrapper Wars: The Week Cursor, Windsurf, and MiniMax All Went Vertical

W02

2025-11-04

Memory Replaces Lock-In: Designing the Substrate of Personal AI

2025-11-07

The Week the Substrate Caught Up: Kimi K2 Thinking Lands

W03

2025-11-11

Why I Just Left a Steady Practice to Build a Personal AI Operating System

2025-11-14

The Week the Harness Ate the Model

W04

2025-11-18

The Decomposition Discipline I Am Trying to Codify Inside DOS

2025-11-21

Antigravity, Gemini 3, Grok 4.1: The Week the Frontier Re-Rendered

W05

2025-11-25

Four Copies, One Source of Truth: The Sync Pattern I Want to Commit to Before It Hurts

2025-11-28

Opus 4.5 Lands on Thanksgiving Week: Anthropic's Quiet Counter-Punch

W06

2025-12-02

Plugin Architecture for Hooks: The Pattern I Want Before the Hook Becomes a God-Function

2025-12-05

The Runtime Is the Moat Now: Anthropic Buys Bun, Claude Code Hits a Billion

W07

2025-12-09

Credit-Metered API Gateways: The Pattern I Want Before I Have Anything to Meter

2025-12-12

Three Frontier Models in 23 Days: Stop Picking a Winner, Start Picking a Router

W08

2025-12-16

Sentinel, Sketched: Convention-Driven Onboarding Before I Build It

2025-12-19

The Spec Goes Public, the Substrate Goes to War: Skills Opens While Codex, Gemini Flash, and Nemotron Sprint

W09

2025-12-23

Memory, Sketched: The Knowledge Graph I Was Designing Before MemPalace Shipped

2025-12-26

Christmas Wasn't Quiet: Open Weights Caught Up, Nvidia Bought Inference, Frontier Labs Bought Loyalty

W10

2025-12-30

Council, Sketched: The Three-Round Debate Protocol I Want to Formalize Before I Build It

2026-01-02

The Practitioners' Year-End: While the West Took PTO, DeepSeek Shipped Architecture

W11

2026-01-06

Altyaa's Wedge: The Brazilian-Portuguese SMB Bet Studio Is Pointed At

2026-01-09

Default Claude: Microsoft Flips the Switch as the Substrate Distributes by Default

W12

2026-01-13

Builder's Compass: Two Years In, ~3,800 Subscribers, and What I've Learned Teaching Architecture

2026-01-16

The Agent Surface Splits in Two: Anthropic Goes Vertical While China Goes Substrate

W13

2026-01-20

Failure Patterns, Sketched: The Bookkeeping Discipline I Want for the Agent's Mistakes

2026-01-23

The Rules Layer Solidifies: Constitution, Hardware, and Export Controls Land in One Week

W14

2026-01-27

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

2026-01-30

Vertical Integration Eats Horizontal AI: Maia, Meta's Capex, and Anthropic-ServiceNow

W15

2026-02-03

Hexagonal in Practice: The Ports and Adapters I'm Pulling Studio Toward

2026-02-06

Coding Becomes the Flagship: Opus 4.6 and GPT-5.3-Codex Land the Same Day

W16

2026-02-10

TDD for AI Agents, Sketched: The Translation I Want to Commit to Before the Eval Suite Exists

2026-02-13

The Bifurcation Hardens: Anthropic's $380B Meets China's Open-Weights Frontier

W17

2026-02-17

Refactoring the Hook Pipeline: The Fowler Walkthrough I'm Three Refactorings Into

2026-02-20

The Sonnet Window: Free-Tier Skills, Code Security, and the Race for Agentic Primitives

W18

2026-02-24

Skills, Packs, and Hooks: The Three-Layer Model I'm Pulling DOS Toward

2026-02-27

The Moat Above the Model: Distillation, Mobile Coding, and MCP Battlegrounds

W19

2026-03-03

The Knowledge Portfolio for an Indie Founder Building in Public, Audited at Week 25

2026-03-06

Procurement Is the New Benchmark: OpenAI Wins the DoW While Anthropic Gets Banned

W20

2026-03-10

Six Months of DOS: What Changed, What Endured, What Surprised

2026-03-13

The Counter-Sovereignty Move: Anthropic Sues, Then Diversifies

W21

2026-03-17

MCP in Production: What Works When the Protocol Becomes the Boundary

2026-03-20

Cursor Becomes a Model Lab: Composer 2 and the Unbundling at the Agent Layer

W22

2026-03-24

Routing by Sovereignty Class: The Architecture That Survives the Procurement Decade

2026-03-27

The Injunction the Pentagon Won't Honor: Anthropic Wins, the State Defies

W23

2026-03-31

Sentinel's First Scan: What the Convention Catalog Found Across Eleven Projects

2026-04-03

The Agent Stack Swallows the IDE, the IPO, and the Courtroom

W24

2026-04-07

Forking MemPalace: The 48-Hour Integration Retro

2026-04-10

Project Glasswing: When the Frontier Bifurcates on Safety

W25

2026-04-14

The 90-Day Open-Weights Bakeoff: What Actually Routes Where in DOS Today

2026-04-17

Routines Eat the Workflow: The Harness Becomes the Product

W26

2026-04-21

The Substrate Portability Pact: What DOS Refuses to Couple to, Even as the Harness Consolidates

2026-04-24

The Substrate Signs a Ten-Year Lease: Gigawatts, Governance, and the Closing Window for Indie Posture

W27

2026-04-28

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

2026-05-01

The Week Exclusivity Died: Substrate Goes Plural, Rules Layer Becomes the Moat