27 weeks · 54 posts · Written while building

Field notes from a personal AI OS in flight

Every Tuesday, an evergreen essay on what I'm learning while shipping DuranteOS. Every Friday, a dispatch from the week. Roughly 108,000 words and counting — for builders who'd rather watch the foundation get poured than read the press release.

Start here

If you're thinking about writing while building

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

If you're considering the founding cohort

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

If you want the architectural spine

The Decomposition Discipline I Am Trying to Codify Inside DOS

Subscribe · Tuesday essay

Around 3,800 builders read this weekly.

Jan 20, 2026

Failure Patterns, Sketched: The Bookkeeping Discipline I Want for the Agent's Mistakes

I have not built the failure pipeline yet. What I have, nineteen weeks in, is the smell — every Monday I correct the agent on the same kind of mistake I corrected the prior Monday, and the correction has no half-life past the current context window. This is the discipline I want to commit to before the next time I am tired enough to give up on it. Kent Beck on empirical software design and failure as data. Andy Hunt and Dave Thomas on Broken Windows and Boiled Frog. Both applied to an agent that is supposed to be getting better and currently is not.

I am writing this on a Tuesday in the third full week of January, nineteen weeks and roughly two hundred and thirty commits into building DuranteOS. Studio has been live for twelve days. The failure pipeline does not exist yet. The signal capture, the structural-fix workflow, the recurring-pattern threshold — none of those have shipped.

What I have is the smell that tells me I need to build it. Every Monday morning I correct the agent on a kind of mistake I corrected the prior Monday. The correction lives in the conversation log. The correction has no half-life past the current context window. The agent is not learning; I am just teaching the same lesson over and over, and the rate at which I am willing to keep teaching it is dropping.

Most agentic systems treat each session as fresh. That is the problem. The system that fails silently is the system that fails repeatedly, and the system that fails repeatedly is the system the operator stops using somewhere around the eighth or ninth time the same correction comes out of the operator's mouth.

I want to commit to the shape of the cure now, before I get too tired to be careful about it. The pipeline I am sketching is not glamorous — a SQLite table, a reflections journal, a small handful of hooks. The bet is that the unglamorous bookkeeping discipline is exactly the thing that makes an agent get steadily better instead of staying steadily the same.

This essay is the design before the code, with two practitioners channelled at the bottom for the framing.

What I want the failure loop to be, in one sentence

Every operator correction captured as a structured signal, classified by severity and pattern, persisted across sessions, and re-injected into the agent's constitutional rules when the same pattern fires four or more times in 30 days.

The Kent Beck angle: empirical software design

Kent Beck's Empirical Software Design substack frames design as a continuous experimental loop: observe what hurts, hypothesize a structural change that would make it hurt less, apply the change, observe whether it actually did. The loop is closed by measurement, not by aesthetic judgment.

The discipline I want for DOS's agent behavior is this loop applied to corrections. Each correction would be an observation that something hurt. The pattern classification would be the hypothesis (this hurt because the agent confused X with Y, or because the rule was missing, or because the prior fix did not generalize). The constitutional rule update would be the structural change. The next 30 days of signal-or-no-signal would be the measurement.

Beck's emphasis on small experiments is what shapes the threshold logic I am committing to. I do not want to act on a single correction (would over-fit to one operator's idiosyncrasy on one bad day). I do not want to wait for ten (the operator gives up before the system responds). My current best guess at the threshold is four-in-30-days. The number is informed by no data; it is a guess that I want to instrument and adjust. The discipline is not the specific number. The discipline is having a number at all, and changing it on evidence rather than on mood.

The Pragmatic Programmers angle: Broken Windows and Boiled Frog

Andy Hunt and Dave Thomas's The Pragmatic Programmer contributes two anti-patterns that both apply directly to this domain.

Broken Windows — the observation that small visible failures cascade. If the agent makes one wrong correction and nobody fixes the underlying pattern, the next wrong correction follows easily because the standard has dropped. The failure pipeline I want exists in part because Broken Windows in agent behavior are catastrophically expensive. A single uncorrected pattern propagates across every subsequent session before anyone notices, and by the time it is noticed the agent has reinforced the wrong behavior fifty times in fifty different contexts.

Boiled Frog — the observation that gradual degradation goes unnoticed. The agent's quality does not collapse all at once; it drifts. A 5.9/10 trailing-month rating that drops to 5.3/10 today does not feel like an emergency. It is. The pipeline's daily trend tracking exists to make Boiled Frog visible before the operator's frustration accumulates to the threshold of cancellation, which is the only signal you get if you are not measuring upstream.

Anti-pattern	Symptom	The countermeasure I want
Broken Windows	Small visible mistakes propagate	Capture every correction; structural fix at 4+ recurrences
Boiled Frog	Quality drifts down invisibly	Daily / weekly / monthly trend tracking with downward-trend alerts
Programming by Coincidence	Behavior works without explanation	REFLECT phase requires named cause for every fix
Knowledge Portfolio decay	Old patterns stop being remembered	Failure corpus permanent, queryable, never archived

The Pragmatic frame is what makes the bookkeeping discipline feel less optional. Hunt and Thomas's whole project is about the small disciplines that compound over years of practice. The failure corpus is one more line in that catalogue, applied to a domain (agent behavior) that did not exist when the original book was written but where the principles transfer cleanly.

What I want the corpus to actually contain

I do not have a corpus yet. I have nineteen weeks of imagined corpus entries — the corrections I have given the agent that I cannot quite remember the specifics of, in distribution that probably looks something like this if I ran the loop today:

Rendering diagram…

The "still-recurring" buckets are the ones that should be triggering the structural-fix workflow if the workflow existed. Three pattern families I can name from memory, before I have built any of this:

Three failure pattern families I have not yet structurally fixed

Concurrent-landing detection. The agent's pre-flight checks look at tracked files but sometimes miss untracked-but-staged files. I correct this maybe twice a month. Fix candidate I have in mind: an explicit git diff --staged --name-only --no-renames step before any pre-flight; not yet implemented because the integration point would be shared across hooks I have not refactored yet.
Skepticism calibration on user-asserted facts. The agent occasionally over-doubts when I state something with high confidence ("yes, that merge already landed"). The current rule is "always verify," which is too strong. The revision I keep half-drafting in my head: "verify when consequence is high; trust when operator names confidence." I have not committed it because I am not sure I trust myself to draw the line cleanly.
Process-stuck-vs-locked confusion. When a long-running process appears unresponsive, the agent sometimes retries instead of investigating, which compounds the actual issue. Fix candidate: a backoff-and-investigate sequence in the constitutional prompt. Not yet wired in.

Naming these in public is uncomfortable. It is also the point. The failure corpus is only useful if it is honest about what is currently failing, not just what has been resolved. I would rather flag the patterns I cannot yet fix than pretend the system is more mature than it is.

What I think capture should look like

Failures should get captured through three distinct surfaces. Each has a different signal-to-noise ratio.

Three surfaces should produce roughly 50-95 raw signals per week, of which about 30% become persistent corpus entries after dedup and noise filtering. The rest stay as ephemeral context for the current session and decay. Those numbers are guesses. I will know after running the loop for a quarter whether the volumes hold.

What I want "structural fix" to mean

The four-in-30-days threshold should trigger a structural-fix workflow. The workflow should be deliberately heavy because the alternative — patching the rule on the spot — has failed me more times than I can count.

The structural-fix workflow when a pattern crosses the threshold

Confirm the pattern. Read all 4+ instances. Is it actually the same root cause, or four separate failures that look similar? My guess is about 30% of suspected patterns turn out to be unrelated on inspection. Skipping this step is how Broken-Windows fixes break additional windows.
Name the root cause. Not the symptom. "Agent forgot to check X" is a symptom. "The constitutional rule about checking X is missing the case where X is conditionally required" is a root cause.
Draft the structural change. Usually a constitutional rule update; sometimes a hook addition; rarely a code change to the agent runtime.
Council-review the change. Recruit two or three specialists relevant to the failure family. For verification failures, recruit Beck and Feathers. Ensure the proposed fix does not create a new failure mode elsewhere. (The Council from Week 10 is also still vapor; this whole stack is sketches reinforcing each other.)
Apply the change. Single commit, with the failure corpus entries linked in the commit message.
Watch the signal for 30 days. If the same pattern reappears, the fix did not work; reopen.
Mark resolved-structurally. The corpus entry annotated with the commit SHA and the fix's rationale. The entry is NOT deleted — future audits need to see the history.

The workflow should take 2-4 hours per pattern. That is more expensive than a quick rule patch. It is much cheaper than continuing to be corrected on the same pattern weekly for a year.

What this would have caught already

I want to give one concrete example of the heuristic paying off, even though the heuristic does not exist yet.

I have, by hand, been tracking one pattern for the last four months: "manual diff -qr and cp invocations are error-prone and inconsistent." The pattern fired roughly eight separate times across four months. Each time I diligently updated the affected files manually, lost attention on a different issue, and forgot. By the eighth time I had spent about eight cumulative hours on what should have been one structural fix.

I eventually applied the structural-fix workflow informally: built sync-check.ts with .dos-sync-manifest.json — the four-copy mechanization I described in Week 5. The build cost was one weekend. The pattern has not fired in eleven weeks since.

That is the value of the heuristic, in concrete numbers. Eight separate hours of frustrating manual work over four months, replaced by one weekend of structural investment that has now run silently for eleven weeks. The total saved time is already 3-4x the build cost. The compounding will continue indefinitely. And it would have been triggered automatically — at the fourth recurrence, two months earlier than it actually was — if the failure pipeline had existed.

That is what I am building toward. Not the specific fix. The trigger that would have caught the pattern at recurrence four instead of recurrence eight.

What the heuristic costs

The four-in-30-days heuristic is biased toward over-investing in structural fixes. Some patterns that cross the threshold will turn out, on inspection, to not warrant structural work — the operator just had a bad week, or a single workflow generated four similar-looking but unrelated corrections. The cost of investigating these is small (2-4 hours wasted on a pattern that does not warrant fix). The cost of NOT investigating is large (the actual structural pattern compounds invisibly into a Boiled Frog). The asymmetry justifies the bias.

Name	Type	Required	Default	Description
signal_threshold	integer	yes	4	Number of recurrences in the rolling window before structural fix is triggered. Guess; will adjust on data.
window_days	integer	yes	30	Rolling window for recurrence counting. Longer windows produce more stable signals but slower response.
corpus_path	directory	yes	~/.claude/MEMORY/LEARNING/FAILURES/{YYYY-MM}/{slug}/	Per-failure directory. Each contains the original signal capture, the resolution, and the followup audit.
reflection_journal	jsonl	yes	~/.claude/MEMORY/LEARNING/REFLECTIONS/algorithm-reflections.jsonl	Append-only stream of all REFLECT outputs. Source of self-correction signals.
sync_target	endpoint	no	/api/v1/failures (Studio)	Optional cloud sync for cross-machine corpus access. Disabled by default.

The Algorithm

REFLECT phase will produce self-correction signals.

A worked example

The R7 signal that produced sync-check.ts.

Another worked example

The 7-loader refactor was triggered by recurring signal.

Council reviews fixes

Where structural-fix proposals get sanity-checked.

The honest version of "the agent gets better over time" is not magic. It is a bookkeeping discipline applied to mistakes, with a structural-fix trigger that is heavy enough to take seriously and a corpus that is queryable enough to learn from. None of this exists yet. The whole stack is sketches reinforcing each other — Sentinel sketched in Week 8, MemPalace sketched in Week 9, the Council sketched in Week 10, the failure pipeline sketched right now. They will land in some order over the next two quarters. The order matters less than the discipline of committing to the shape before exhaustion makes me settle for less.

The trend line right now is going down — today's rating, if I rated honestly, would be a 5.3/10 against a month-trailing 5.9/10. That is a Boiled Frog signal. I am taking it seriously this week. Two patterns are about to cross the four-recurrence threshold I have not yet implemented; both will get structural-fix workflows applied before next week, by hand, with the same shape the eventual pipeline should automate.

The system that does not name when it is failing is the system that fails silently. I would rather name it.

Was this page helpful?

The 27-week arc · A single body of work

Twenty-seven weeks. Two posts a week. Six months of writing while building.

Week

Tuesday evergreen

Friday dispatch

W01

2025-10-28

The Agent Is the Product: Why Intelligence Replaces Interface

2025-10-31

Wrapper Wars: The Week Cursor, Windsurf, and MiniMax All Went Vertical

W02

2025-11-04

Memory Replaces Lock-In: Designing the Substrate of Personal AI

2025-11-07

The Week the Substrate Caught Up: Kimi K2 Thinking Lands

W03

2025-11-11

Why I Just Left a Steady Practice to Build a Personal AI Operating System

2025-11-14

The Week the Harness Ate the Model

W04

2025-11-18

The Decomposition Discipline I Am Trying to Codify Inside DOS

2025-11-21

Antigravity, Gemini 3, Grok 4.1: The Week the Frontier Re-Rendered

W05

2025-11-25

Four Copies, One Source of Truth: The Sync Pattern I Want to Commit to Before It Hurts

2025-11-28

Opus 4.5 Lands on Thanksgiving Week: Anthropic's Quiet Counter-Punch

W06

2025-12-02

Plugin Architecture for Hooks: The Pattern I Want Before the Hook Becomes a God-Function

2025-12-05

The Runtime Is the Moat Now: Anthropic Buys Bun, Claude Code Hits a Billion

W07

2025-12-09

Credit-Metered API Gateways: The Pattern I Want Before I Have Anything to Meter

2025-12-12

Three Frontier Models in 23 Days: Stop Picking a Winner, Start Picking a Router

W08

2025-12-16

Sentinel, Sketched: Convention-Driven Onboarding Before I Build It

2025-12-19

The Spec Goes Public, the Substrate Goes to War: Skills Opens While Codex, Gemini Flash, and Nemotron Sprint

W09

2025-12-23

Memory, Sketched: The Knowledge Graph I Was Designing Before MemPalace Shipped

2025-12-26

Christmas Wasn't Quiet: Open Weights Caught Up, Nvidia Bought Inference, Frontier Labs Bought Loyalty

W10

2025-12-30

Council, Sketched: The Three-Round Debate Protocol I Want to Formalize Before I Build It

2026-01-02

The Practitioners' Year-End: While the West Took PTO, DeepSeek Shipped Architecture

W11

2026-01-06

Altyaa's Wedge: The Brazilian-Portuguese SMB Bet Studio Is Pointed At

2026-01-09

Default Claude: Microsoft Flips the Switch as the Substrate Distributes by Default

W12

2026-01-13

Builder's Compass: Two Years In, ~3,800 Subscribers, and What I've Learned Teaching Architecture

2026-01-16

The Agent Surface Splits in Two: Anthropic Goes Vertical While China Goes Substrate

W13

2026-01-20

Failure Patterns, Sketched: The Bookkeeping Discipline I Want for the Agent's Mistakes

2026-01-23

The Rules Layer Solidifies: Constitution, Hardware, and Export Controls Land in One Week

W14

2026-01-27

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

2026-01-30

Vertical Integration Eats Horizontal AI: Maia, Meta's Capex, and Anthropic-ServiceNow

W15

2026-02-03

Hexagonal in Practice: The Ports and Adapters I'm Pulling Studio Toward

2026-02-06

Coding Becomes the Flagship: Opus 4.6 and GPT-5.3-Codex Land the Same Day

W16

2026-02-10

TDD for AI Agents, Sketched: The Translation I Want to Commit to Before the Eval Suite Exists

2026-02-13

The Bifurcation Hardens: Anthropic's $380B Meets China's Open-Weights Frontier

W17

2026-02-17

Refactoring the Hook Pipeline: The Fowler Walkthrough I'm Three Refactorings Into

2026-02-20

The Sonnet Window: Free-Tier Skills, Code Security, and the Race for Agentic Primitives

W18

2026-02-24

Skills, Packs, and Hooks: The Three-Layer Model I'm Pulling DOS Toward

2026-02-27

The Moat Above the Model: Distillation, Mobile Coding, and MCP Battlegrounds

W19

2026-03-03

The Knowledge Portfolio for an Indie Founder Building in Public, Audited at Week 25

2026-03-06

Procurement Is the New Benchmark: OpenAI Wins the DoW While Anthropic Gets Banned

W20

2026-03-10

Six Months of DOS: What Changed, What Endured, What Surprised

2026-03-13

The Counter-Sovereignty Move: Anthropic Sues, Then Diversifies

W21

2026-03-17

MCP in Production: What Works When the Protocol Becomes the Boundary

2026-03-20

Cursor Becomes a Model Lab: Composer 2 and the Unbundling at the Agent Layer

W22

2026-03-24

Routing by Sovereignty Class: The Architecture That Survives the Procurement Decade

2026-03-27

The Injunction the Pentagon Won't Honor: Anthropic Wins, the State Defies

W23

2026-03-31

Sentinel's First Scan: What the Convention Catalog Found Across Eleven Projects

2026-04-03

The Agent Stack Swallows the IDE, the IPO, and the Courtroom

W24

2026-04-07

Forking MemPalace: The 48-Hour Integration Retro

2026-04-10

Project Glasswing: When the Frontier Bifurcates on Safety

W25

2026-04-14

The 90-Day Open-Weights Bakeoff: What Actually Routes Where in DOS Today

2026-04-17

Routines Eat the Workflow: The Harness Becomes the Product

W26

2026-04-21

The Substrate Portability Pact: What DOS Refuses to Couple to, Even as the Harness Consolidates

2026-04-24

The Substrate Signs a Ten-Year Lease: Gigawatts, Governance, and the Closing Window for Indie Posture

W27

2026-04-28

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

2026-05-01

The Week Exclusivity Died: Substrate Goes Plural, Rules Layer Becomes the Moat