I am writing this on a Tuesday at the end of March, twenty-nine weeks and roughly two hundred and eighty-five commits into building DuranteOS. Studio has been live for eighty-one days. Sentinel — the codebase-convention scanner I sketched in W8 — shipped in production form last Friday, fourteen weeks after the design essay went up. This past weekend I ran it across all eleven projects in DOS's project registry. The retrospective on what it found is this essay.
The W8 essay committed the design before the code. The promise was: the scan should distill 5-15 conventions per project, write them to .sentinel/ and to the project's CLAUDE.md, and act as an information radiator the agent reads at every session start. The non-goals were equally specific: do not enforce, do not catalog every pattern, do not freeze the conventions.
Three days of weekend scanning later, I can say which parts of the design held, which parts surprised me, and which parts were wrong. Writing the retrospective in public — the same week the scanner is fresh enough that I can still remember what each scan felt like — is the discipline that keeps the retro honest. The version I would write in three months will be cleaner; the version I am writing today is more useful because it includes the ugliness.
The retro in one sentence
The shape of the design held. The scanner found 6-14 conventions per project (target was 5-15) at a confidence level that pattern-matched my own intuition about each codebase. The three biggest surprises were not in what the scanner found but in what it missed and why; one of them changed the production design.
The Sandi Metz angle: the squint test as scanner
Sandi Metz's squint test — the trick of half-closing your eyes and letting the shape of code reveal what the literal text obscures — is the closest analog I have for what Sentinel does mechanically. The scanner is not parsing semantics. It is detecting shape across many files, looking for what repeats often enough to be a convention.
The Four Rules from POODR — small classes, single-responsibility methods, fewer instance variables, name what you call — are the kinds of conventions a scanner can look for without understanding what the code does. The scanner reads the directory structure, the import patterns, the file-naming conventions, the test-file pairings, and looks for repeating patterns. What repeats in 80%+ of files becomes a candidate convention. What repeats in 30-80% becomes flagged drift. What repeats in under 30% gets ignored.
Three days into the first weekend's scans, the threshold cuts I committed to in the W8 essay held within two-percentage-point variance per project. The 80% threshold produced a clean list of 6-14 conventions per project. The 30% threshold produced 2-4 drift flags per project — the kind of patterns that look like they want to be conventions but are not yet at canonical levels. The under-30% noise floor stayed quiet.
Metz's lesson the squint test teaches is the one I needed most: the scanner does not have to understand the code; it has to recognize the shape. If the shape repeats, it is a convention. If it does not, it is not yet. The mechanical discipline beats interpretive cleverness every time.
The Feathers angle: characterization tests for conventions
Michael Feathers's characterization test from Working Effectively with Legacy Code is the other shape Sentinel borrows. A characterization test captures what code currently does, even if you suspect what it does is wrong. The point is not to test correctness; the point is to capture the existing behavior so future changes can be scored against it.
Sentinel does this for conventions. The scan captures what the codebase currently looks like — the conventions that emerge from actual practice, not the conventions any document might claim. A CONTRIBUTING.md that says "use service-layer naming for backend modules" is irrelevant if the actual codebase uses six different naming patterns. The Sentinel scan reads the code, not the documentation. What the code currently does is the floor; future changes either match it or don't.
The framing helps me think about what the scanner should not do. It should not advise. It should not score against best practice. It should not refer to external rubrics. It should report: here is what your code currently does, in eight specific patterns, with their prevalence percentages. The agent then has a working theory of the codebase's actual conventions. Whether those conventions are good is a separate conversation — the kind that belongs in the Council pattern, not in a scanner.
What the scan actually found across eleven projects
The numbers from the weekend's run, projects ordered by code volume:
| Project | Conventions surfaced | Drift flags | Time to scan | Surprise (1-5) |
|---|---|---|---|---|
| Studio (Next.js + Prisma SaaS) | 14 | 4 | 67 sec | 2 |
| Altyaa (PT-BR SMB SaaS) | 12 | 3 | 51 sec | 1 |
| DOS itself (Claude Code config + skills) | 11 | 6 | 89 sec | 4 |
| dos-prisma-saas-kit (the kit fork) | 11 | 2 | 44 sec | 1 |
| Donne (CRM platform) | 10 | 3 | 38 sec | 2 |
| AdCore Turbo (Next.js + Prisma) | 9 | 2 | 32 sec | 1 |
| AxReady (multi-tenant ticketing) | 9 | 4 | 41 sec | 3 |
| Era Materna (pregnancy SaaS) | 8 | 3 | 29 sec | 2 |
| The Road to Next (course app) | 7 | 1 | 22 sec | 1 |
| Exordiom BDR | 7 | 2 | 25 sec | 2 |
| AXReady (sub-pack scan) | 6 | 2 | 18 sec | 3 |
Eleven projects. 104 conventions total. 32 drift flags. Average scan time 41 seconds. Total time across all eleven projects: just under eight minutes.
The "surprise" column is what I want to write about for the rest of this essay. I scored each scan on a 1-5 scale of how much the convention catalog surprised me — 1 being "this is exactly what I would have written by hand," 5 being "the scanner found patterns I did not know were there."
The three biggest surprises
Three things the first scan surfaced that I did not expect
- DOS itself scored a 4. I have written DOS for twenty-nine weeks. I would have predicted the scan would tell me nothing new about my own codebase. It told me three things: (a) my hook-loader files cluster around a four-method shape (
name,slot,phase,load) that I never explicitly designed but that emerged after the refactoring sequence I described in W17 — Sentinel codified an implicit interface that I had been treating as informal. (b) Skill files have a stable five-section structure I follow without thinking; the scanner named it. (c) My Council seat agents share a verbatim prologue I had been pasting between files; Sentinel flagged the duplication as a convention candidate. The discipline of the squint test surfaced patterns I was too close to see. - AxReady scored a 3 because it had two competing conventions on the same axis. The multi-tenant ticketing app uses two different patterns for tenant scoping (one in
apps/web, one inapps/admin). The scanner correctly flagged both as 50% drift, with neither at convention threshold. I had not noticed the split. The scan made it visible. This is exactly the Broken Windows signal from W13 — small inconsistencies that cascade if not surfaced — caught by the scanner before I tripped on it. - The DOS scan also surfaced a pattern that turned out to be wrong. Sentinel surfaced "Bun-only execution; no Node fallback" as an 80%+ convention. That is true in my workflow. It is not true in the codebase: a contributor who clones the repo and runs
npminstead ofbunon the wrong file would hit non-Bun-compatible code paths. The convention is what I do, not what the code enforces. Sentinel cannot tell the difference yet — the scanner reads what is there, but cannot distinguish "what the operator does" from "what the code requires." The scan was technically correct and operationally misleading.
The third surprise is the one that changed Sentinel's production design. The scanner now distinguishes between operator conventions (patterns visible only when you watch the operator work) and code conventions (patterns visible to any contributor reading the code). The 80% threshold applies separately to each. The output catalog labels them. The change took half a day; the lesson cost more than that.
What the scan got right that I expected
Five things the scan caught cleanly across most projects, in order of robustness.
One. Stack identification. Every project's stack — language, framework, ORM, build system, deployment target — was identified correctly. This is the table-stakes work; the scanner reads package.json, tsconfig.json, tailwind.config, and produces the one-paragraph stack summary the W8 essay specified. Eleven of eleven projects passed.
Two. Convention extraction at 80%+ prevalence. Where a pattern was genuinely canonical, the scan named it. Tenant-root entity in multi-tenancy projects. UUID primary keys. Service-layer naming. Hook-pipeline shapes. Skill-file structures. The scanner was not creative — it found what was there.
Three. Drift flagging at 30-80%. Genuine drift was correctly identified as drift, not as convention. The scan did not over-promote half-formed patterns into the canonical list. The threshold cuts held.
Four. Per-project Ubiquitous Language extraction. The Eric Evans frame from W8 — distilling the project's actual vocabulary — produced specific, project-flavored output: "Organization is the multi-tenant root" for Studio, "Reading session" for Era Materna, "Coach" for Altyaa. Each project's own vocabulary surfaced in its own catalog.
Five. Stale-content flagging. Each catalog is dated. The agent reads the date during session start and surfaces a banner if the scan is more than thirty days old. The mechanism worked on first ship — I do not know yet how often I will see the staleness banner because I just ran the scans, but the plumbing is there.
What the scan got wrong (besides the operator-vs-code-conventions issue)
Three real failures that I am noting publicly because pretending the scanner was perfect on first pass is dishonest.
What this implies if you are running scans across multiple projects
Three suggestions, each costly enough that I considered cutting them and was wrong each time.
One. Score your own scans on the surprise scale. The numbers tell you whether the scanner is working: 1 means the scanner is reproducing what you already know (low value); 4-5 means the scanner is finding patterns you did not see (high value). DOS at 4 was the highest-value scan of the weekend; the others at 1-2 were confirmations.
Two. Distinguish operator conventions from code conventions explicitly. The third surprise above cost me half a day of confused output. Do not let your scanner conflate "what the operator does" with "what the code enforces." They are different categories with different audiences.
Three. Run the scan, then read the catalog as if you have never seen the code. The squint-test discipline only works if you can put your own knowledge aside long enough to see what the scanner saw. If the catalog feels obvious, you are reading it through your own assumptions; if it feels like reading someone else's notes about your code, the scanner is doing its job.
What the scan does to the agent's behavior
The next session in any of these eleven projects, the agent reads the project's CLAUDE.md and the .sentinel/ directory at session start. It now knows the conventions. The first session-after-scan I ran in Donne (a project I had not opened in three weeks), the agent's first message back to me referenced two conventions correctly without my prompting. That is the information-radiator effect from the W8 essay, finally visible in production.
The W8 sketch
The design essay this scan retrospective tests.
The three-layer model
Where Sentinel sits as a Pack.
MCP boundary
Sentinel ships as an MCP server in this same week.
The natural next step
Council-driven drift resolution.
The version of Sentinel that shipped last week is not the version I will be running at the end of Q2. The mtime-weighting fix lands this week. The cross-project aggregation lands the week after. The drift-resolution Council workflow lands somewhere in May. The retrospective on the post-Q2 version of Sentinel ships at the end of June, after another quarter of running scans across the same eleven projects.
What this first weekend already proved: the W8 design held. The thresholds held. The information-radiator promise held. The three failures I named above are correctable in days, not quarters. That is what shipping the design before the code buys you — when the code lands, the failures are known unknowns you can fix on a Tuesday, not unknown unknowns that surface on a customer call.
I would rather have the receipts. I have them now.
Was this page helpful?





