27 semanas · 54 textos · Escritos durante a construção

Notas de campo de um SO de IA pessoal em voo

Toda terça-feira, um ensaio perene sobre o que aprendi enquanto envio o DuranteOS. Toda sexta, um boletim da semana. Cerca de 108 mil palavras e contando — para construtores que preferem ver a fundação ser lançada a ler o press release.

Comece por aqui

Se você está pensando em escrever enquanto constrói

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

Se você está considerando o grupo fundador

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

Se você quer a espinha arquitetônica

The Decomposition Discipline I Am Trying to Codify Inside DOS

Assinar · Ensaio de terça

Cerca de 3.800 construtores leem isso toda semana.

Apr 14, 2026

The 90-Day Open-Weights Bakeoff: What Actually Routes Where in DOS Today

Studio has been routing across multiple substrates for ninety-five days. Seven providers integrated, ranging from Claude Sonnet 4.6 at one extreme to Google's open-weights Gemma 4 at the other. This essay is the empirical retrospective — which model handles which call type, where the routing logic sends each request by default, where the fallback kicks in, what the actual cost-per-quality numbers look like across roughly thirty thousand routed calls. Kent Beck on the smallest-experiment discipline applied to substrate selection. Andy Hunt and Dave Thomas on the Knowledge Portfolio reading: diversify, review, rebalance. Both applied to the moment routing decisions stop being intuition and become data.

I am writing this on a Tuesday in mid-April, thirty-one weeks and roughly two hundred and ninety-five commits into building DuranteOS. Studio has been live for ninety-five days. The Hexagonal Provider port has been operational since early February. The five-axis routing layer from W22 has been in production for three weeks. The seven adapters in the registry today are: Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4, GPT-5.4 mini, Google Gemma 4 (the 31B Dense variant), Cursor Composer 2 (which is itself a fine-tune of Kimi K2.5), and the Kimi K2.5 base via OpenRouter.

Three months of routing across all seven is enough data to write the bakeoff retrospective on. The architecture I committed to in late January has produced roughly thirty thousand routed calls across that time, with cost, latency, and quality tracked per route. The numbers tell a story I had not fully predicted, which is exactly why writing the empirical retrospective at the ninety-day mark is the right discipline.

This is the post the W14 founding-customer essay was waiting for me to be able to write. Until I had real production data on which substrate routes where, the architecture pitch was theoretical. After three months, the pitch has receipts.

The retro in one sentence

The "best frontier model wins everything" framing dies on the first day of multi-vendor routing. By day ninety, the right substrate per call is determined by call-type, not by general capability — and the cost-per-quality differences across providers run six-to-twenty times for inner-loop work, exactly as the substrate-bifurcation thesis from W16 predicted.

The Kent Beck angle: empirical bakeoff as smallest experiment

Kent Beck's Empirical Software Design substack contributed the framing that made the bakeoff possible at all: every routing decision should be measurable; every measurement should be cheap; every revision should be evidence-driven, not aesthetic.

The discipline I committed to ninety days ago was simple. Each adapter records, per call: the call-type, the input-token count, the output-token count, the latency from request-issued to response-streamed, the cost in USD, and the operator's rating-on-completion (when the operator rates). Studio already had the cost and latency tracking from the credit-ledger work; the rating signal is sparse but real.

That is the smallest experiment that can produce signal. No formal eval suite at this scale yet — the W16 eval-suite essay describes the suite I am still in the middle of building. What I have is production-cost-and-rating data, which is a coarser signal than evals but accumulates faster. Beck's argument is exactly that — start with the cheapest measurement that produces signal; refine the measurement when the data starts disagreeing with itself.

The data has not disagreed with itself yet. The patterns are clean.

The Pragmatic angle: substrate selection as Knowledge Portfolio

Andy Hunt and Dave Thomas's Knowledge Portfolio framing from W19 applies almost directly to substrate selection. The four operating rules — invest regularly, diversify, manage risk, review and rebalance — map onto routing decisions one-for-one.

Portfolio rule	Substrate routing analog
Invest regularly	Add a new provider every 4-6 weeks; do not let the registry stagnate
Diversify	Hold providers with different cost / capability / sovereignty-class profiles
Manage risk	Pair high-quality high-cost frontier providers with cheaper open-weights fallbacks
Review and rebalance	Quarterly retrospective on routing data; promote providers that earned it, demote ones that did not

The W19 essay's portfolio audit talked about learning investments. The same shape applies to substrate selection. The W22 routing essay encoded the architecture; this essay is the first formal review-and-rebalance cycle the architecture supports.

What the registry looks like today

Seven adapters, with the routing-policy defaults that emerged from ninety days of production data:

Adapter	Sovereignty class	Default routing for	Approximate per-call cost	Quality score (1-5)
Claude Sonnet 4.6	US-frontier	Default agentic-chat, all coding by default	medium	4.6
Claude Opus 4.6	US-frontier	High-stakes architectural decisions, Council synthesis	high	4.8
GPT-5.4	US-frontier	Drafting, summarization, polishing	medium	4.4
GPT-5.4 mini	US-frontier	Inner-loop / cheap reasoning, eval-judge	low	3.8
Gemma 4 (31B Dense)	open-weights / US-origin	Privacy-sensitive content, on-laptop fallback	very low (local)	3.6
Cursor Composer 2	open-weights / Kimi K2.5 fine-tune	Multilingual coding, rare-language code review	medium-low	4.2
Kimi K2.5 (base)	open-weights / non-US-silicon path	Fallback when sovereignty constraints exclude US-frontier	low	4.0

The quality scores are operator ratings averaged across roughly thirty thousand calls — coarse but stable. The cost ranges are categorical, not point estimates, because the actual numbers vary too much per call type to compress into a single figure. The default routing column is the operative output of the bakeoff: it tells the routing layer where to send each call type unless tenant policy or capability requirements override.

What the data taught me

Three patterns emerged that I had not fully predicted in W17 or W22.

Three findings from 90 days of routing data

GPT-5.4 mini is dramatically more capable per dollar than I expected. Six weeks ago the W17 dispatch covered the launch. The intuition was "it'll handle some inner-loop work, but anything serious goes to Sonnet 4.6 or Opus." Three months of data say something different — GPT-5.4 mini handles inner-loop reasoning, eval-judging, drafting, and summarization at quality scores within 0.6 points of Sonnet 4.6 on those specific call types, at roughly one-tenth the cost. The default-routing column above shifted toward GPT-5.4 mini for the cheap-call category as a result. I expected the open-weights tier to dominate this slot; the small US-frontier model got there first.
Composer 2's multilingual edge is real and underused. The W21 dispatch noted that Composer 2 (Kimi K2.5 fine-tune) ranked strongly on SWE-bench Multilingual. Three months of routing data says the multilingual edge is a load-bearing differentiator for a specific call type — code review on non-English-comment code, polyglot codebase work, anything with mixed-language identifiers. Composer 2 outperforms Sonnet 4.6 on these by 0.3 quality points on a 5-scale. I have not yet integrated it for general coding because the SWE-bench gap is real, but for the multilingual sub-category it is now the default. The Knowledge Portfolio diversification rule paid back here in a specific way I did not predict.
Gemma 4's privacy-fallback role is the most important thing it does. Gemma 4's quality score is the lowest of the seven adapters — 3.6 against the others' 3.8-4.8. That looks like a problem until you look at which calls it gets routed to. Tenant-policy rules send any call tagged no_training_on_data or eu_data_residency to Gemma 4 by default, because it can run locally on the operator's machine with no data leaving the laptop. The quality score is averaged over a category of calls where the alternative is no model at all — sovereignty-constrained customers cannot route to US-frontier providers without violating their constraints. Gemma 4's quality is "good enough to be the privacy-fallback," and the privacy-fallback slot is the highest-leverage one in Studio's customer mix today.

The third finding is the one I am most glad I waited for empirical evidence on. The architecture-essay from W22 declared sovereignty-class routing as a first-class concern. Without a privacy-fallback adapter that actually runs in the role, the architecture would have been theoretical. Gemma 4 made it operational.

What the data taught me about my prior assumptions

Three things I was meaningfully wrong about ninety days ago.

Day 1 routing assumptions vs. Day 90 reality

Day 1 assumption

"Frontier models do everything; open weights are a hedge for outages"
"Cost differences across providers are 2-3x at most for similar-quality work"
"The routing layer's main job is reliability — handle vendor outages gracefully"
"Per-call cost-tracking is overhead I will rationalize later"
"Sovereignty-class routing is theoretical until the first restricted customer signs"

Day 90 reality

Frontier models do some things; open-weights does others meaningfully better at lower cost
Cost differences across providers are 6-20x for inner-loop work; the routing has to capture the asymmetry
The routing layer's main job is cost-per-quality optimization; reliability is a side-effect
Per-call cost tracking is the data-source the bakeoff entirely depends on; would not have this essay without it
Sovereignty-class routing produces visible product behavior even before the first restricted customer signs (privacy-fallback to Gemma 4 covers a class of calls that would otherwise have no path)

The asymmetry-of-cost finding is the one with the most operational impact. A naive routing layer that always picks the highest-quality provider produces unit economics that do not scale. The same workload routed empirically across the bakeoff data produces roughly forty-three percent cost reduction at no measurable quality loss for the call types where mini-tier or open-weights adapters earned their default-routing slot. The bakeoff paid for itself thirty days into the ninety-day window.

Three pitfalls I avoided that would have cost me

Naming the failure modes I sidestepped is more useful than naming the wins. Three patterns I would warn another founder about.

The third discipline is the one that the W22 architecture essay committed to in advance. Thirty thousand calls of production data later, it has held. Adapter metadata is the right place to encode constraints. The routing function is supposed to be small.

What this implies if you are running a multi-vendor gateway

Three suggestions, each costly enough that I considered cutting them and was wrong each time.

One. Track per-call cost the day you start routing across providers. The data is the bakeoff. Without it, you have intuitions; with it, you have decisions you can defend. The cost of the tracking is small (one event per call, append-only); the benefit compounds with every adapter you add.

Two. Run the smallest-experiment integration discipline. New adapters land at low default-routing weight, accumulate two weeks of production rating data, then get promoted or demoted on evidence. The cost of the discipline is occasional under-utilization of strong adapters in their first two weeks; the benefit is that you do not over-promote on theoretical numbers and have to reverse later.

Three. Run the quarterly review-and-rebalance as a forcing function. The Pragmatic Knowledge Portfolio rules apply. Adapters that have not earned their slot get demoted. Adapters that have earned more get promoted. The discipline keeps the registry honest. The cost is one essay's worth of analysis per quarter (this essay being the first); the benefit is a registry that compounds rather than calcifies.

What I am NOT publishing

The exact rating numbers, per-call cost figures, and operator names from the ninety-day data set. The aggregate patterns are above; the individual data points belong to the operators who produced them. Studio's privacy posture is one of the things customers can evaluate, and the bakeoff data is part of that posture. The architecture-essay version of these numbers is what the public should see; the per-customer version is what the customer's audit log should hold.

The architecture

The five-axis routing layer this bakeoff exercises.

The hexagonal foundation

Where the Provider port lives.

The protocol layer

How the bakeoff integrates with MCP servers.

Studio's gateway pattern

Where the cost-tracking event log accumulates.

The bakeoff is the discipline. The architecture is the affordance. The thirty thousand calls are the data. Each compounded into the others over ninety-five days, and the result is a routing layer that does not just function — it produces decisions I can defend with numbers, against which any future adapter has to earn its slot.

The next quarterly retro ships in mid-July, after another ninety days of data. By then DeepSeek V4 will be in the registry. Mythos's Glasswing-tier capability will or will not have opened access paths that change the bakeoff. The first founding customer will be running calls through Studio under their own tenant policy. The retro will either confirm the patterns from the first ninety days or revise them on evidence — and either result will be more useful than continuing to operate without a quarterly bakeoff.

I will write the next one in July, on a Tuesday, with the data.

Was this page helpful?

O arco de 27 semanas · Um corpo de trabalho

Vinte e sete semanas. Dois textos por semana. Seis meses de escrita durante a construção.

Semana

Ensaio de terça

Boletim de sexta

W01

2025-10-28

The Agent Is the Product: Why Intelligence Replaces Interface

2025-10-31

Wrapper Wars: The Week Cursor, Windsurf, and MiniMax All Went Vertical

W02

2025-11-04

Memory Replaces Lock-In: Designing the Substrate of Personal AI

2025-11-07

The Week the Substrate Caught Up: Kimi K2 Thinking Lands

W03

2025-11-11

Why I Just Left a Steady Practice to Build a Personal AI Operating System

2025-11-14

The Week the Harness Ate the Model

W04

2025-11-18

The Decomposition Discipline I Am Trying to Codify Inside DOS

2025-11-21

Antigravity, Gemini 3, Grok 4.1: The Week the Frontier Re-Rendered

W05

2025-11-25

Four Copies, One Source of Truth: The Sync Pattern I Want to Commit to Before It Hurts

2025-11-28

Opus 4.5 Lands on Thanksgiving Week: Anthropic's Quiet Counter-Punch

W06

2025-12-02

Plugin Architecture for Hooks: The Pattern I Want Before the Hook Becomes a God-Function

2025-12-05

The Runtime Is the Moat Now: Anthropic Buys Bun, Claude Code Hits a Billion

W07

2025-12-09

Credit-Metered API Gateways: The Pattern I Want Before I Have Anything to Meter

2025-12-12

Three Frontier Models in 23 Days: Stop Picking a Winner, Start Picking a Router

W08

2025-12-16

Sentinel, Sketched: Convention-Driven Onboarding Before I Build It

2025-12-19

The Spec Goes Public, the Substrate Goes to War: Skills Opens While Codex, Gemini Flash, and Nemotron Sprint

W09

2025-12-23

Memory, Sketched: The Knowledge Graph I Was Designing Before MemPalace Shipped

2025-12-26

Christmas Wasn't Quiet: Open Weights Caught Up, Nvidia Bought Inference, Frontier Labs Bought Loyalty

W10

2025-12-30

Council, Sketched: The Three-Round Debate Protocol I Want to Formalize Before I Build It

2026-01-02

The Practitioners' Year-End: While the West Took PTO, DeepSeek Shipped Architecture

W11

2026-01-06

Altyaa's Wedge: The Brazilian-Portuguese SMB Bet Studio Is Pointed At

2026-01-09

Default Claude: Microsoft Flips the Switch as the Substrate Distributes by Default

W12

2026-01-13

Builder's Compass: Two Years In, ~3,800 Subscribers, and What I've Learned Teaching Architecture

2026-01-16

The Agent Surface Splits in Two: Anthropic Goes Vertical While China Goes Substrate

W13

2026-01-20

Failure Patterns, Sketched: The Bookkeeping Discipline I Want for the Agent's Mistakes

2026-01-23

The Rules Layer Solidifies: Constitution, Hardware, and Export Controls Land in One Week

W14

2026-01-27

The One Reference Customer Strategy: GTM for a Personal AI OS, Sketched Before the Customer Signs

2026-01-30

Vertical Integration Eats Horizontal AI: Maia, Meta's Capex, and Anthropic-ServiceNow

W15

2026-02-03

Hexagonal in Practice: The Ports and Adapters I'm Pulling Studio Toward

2026-02-06

Coding Becomes the Flagship: Opus 4.6 and GPT-5.3-Codex Land the Same Day

W16

2026-02-10

TDD for AI Agents, Sketched: The Translation I Want to Commit to Before the Eval Suite Exists

2026-02-13

The Bifurcation Hardens: Anthropic's $380B Meets China's Open-Weights Frontier

W17

2026-02-17

Refactoring the Hook Pipeline: The Fowler Walkthrough I'm Three Refactorings Into

2026-02-20

The Sonnet Window: Free-Tier Skills, Code Security, and the Race for Agentic Primitives

W18

2026-02-24

Skills, Packs, and Hooks: The Three-Layer Model I'm Pulling DOS Toward

2026-02-27

The Moat Above the Model: Distillation, Mobile Coding, and MCP Battlegrounds

W19

2026-03-03

The Knowledge Portfolio for an Indie Founder Building in Public, Audited at Week 25

2026-03-06

Procurement Is the New Benchmark: OpenAI Wins the DoW While Anthropic Gets Banned

W20

2026-03-10

Six Months of DOS: What Changed, What Endured, What Surprised

2026-03-13

The Counter-Sovereignty Move: Anthropic Sues, Then Diversifies

W21

2026-03-17

MCP in Production: What Works When the Protocol Becomes the Boundary

2026-03-20

Cursor Becomes a Model Lab: Composer 2 and the Unbundling at the Agent Layer

W22

2026-03-24

Routing by Sovereignty Class: The Architecture That Survives the Procurement Decade

2026-03-27

The Injunction the Pentagon Won't Honor: Anthropic Wins, the State Defies

W23

2026-03-31

Sentinel's First Scan: What the Convention Catalog Found Across Eleven Projects

2026-04-03

The Agent Stack Swallows the IDE, the IPO, and the Courtroom

W24

2026-04-07

Forking MemPalace: The 48-Hour Integration Retro

2026-04-10

Project Glasswing: When the Frontier Bifurcates on Safety

W25

2026-04-14

The 90-Day Open-Weights Bakeoff: What Actually Routes Where in DOS Today

2026-04-17

Routines Eat the Workflow: The Harness Becomes the Product

W26

2026-04-21

The Substrate Portability Pact: What DOS Refuses to Couple to, Even as the Harness Consolidates

2026-04-24

The Substrate Signs a Ten-Year Lease: Gigawatts, Governance, and the Closing Window for Indie Posture

W27

2026-04-28

After Twenty-Seven Weeks: What the Series Was For, and What I'm Building Next

2026-05-01

The Week Exclusivity Died: Substrate Goes Plural, Rules Layer Becomes the Moat