Pular para o conteúdo
DuranteDurante
ALL SYSTEMSGet Access

27 semanas · 54 textos · Escritos durante a construção

Notas de campo de um SO de IA pessoal em voo

Toda terça-feira, um ensaio perene sobre o que aprendi enquanto envio o DuranteOS. Toda sexta, um boletim da semana. Cerca de 108 mil palavras e contando — para construtores que preferem ver a fundação ser lançada a ler o press release.

Assinar · Ensaio de terça

Cerca de 3.800 construtores leem isso toda semana.

The 90-Day Open-Weights Bakeoff: What Actually Routes Where in DOS Today

Studio has been routing across multiple substrates for ninety-five days. Seven providers integrated, ranging from Claude Sonnet 4.6 at one extreme to Google's open-weights Gemma 4 at the other. This essay is the empirical retrospective — which model handles which call type, where the routing logic sends each request by default, where the fallback kicks in, what the actual cost-per-quality numbers look like across roughly thirty thousand routed calls. Kent Beck on the smallest-experiment discipline applied to substrate selection. Andy Hunt and Dave Thomas on the Knowledge Portfolio reading: diversify, review, rebalance. Both applied to the moment routing decisions stop being intuition and become data.

I am writing this on a Tuesday in mid-April, thirty-one weeks and roughly two hundred and ninety-five commits into building DuranteOS. Studio has been live for ninety-five days. The Hexagonal Provider port has been operational since early February. The five-axis routing layer from W22 has been in production for three weeks. The seven adapters in the registry today are: Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4, GPT-5.4 mini, Google Gemma 4 (the 31B Dense variant), Cursor Composer 2 (which is itself a fine-tune of Kimi K2.5), and the Kimi K2.5 base via OpenRouter.

Three months of routing across all seven is enough data to write the bakeoff retrospective on. The architecture I committed to in late January has produced roughly thirty thousand routed calls across that time, with cost, latency, and quality tracked per route. The numbers tell a story I had not fully predicted, which is exactly why writing the empirical retrospective at the ninety-day mark is the right discipline.

This is the post the W14 founding-customer essay was waiting for me to be able to write. Until I had real production data on which substrate routes where, the architecture pitch was theoretical. After three months, the pitch has receipts.

The retro in one sentence

The "best frontier model wins everything" framing dies on the first day of multi-vendor routing. By day ninety, the right substrate per call is determined by call-type, not by general capability — and the cost-per-quality differences across providers run six-to-twenty times for inner-loop work, exactly as the substrate-bifurcation thesis from W16 predicted.

The Kent Beck angle: empirical bakeoff as smallest experiment

Kent Beck's Empirical Software Design substack contributed the framing that made the bakeoff possible at all: every routing decision should be measurable; every measurement should be cheap; every revision should be evidence-driven, not aesthetic.

The discipline I committed to ninety days ago was simple. Each adapter records, per call: the call-type, the input-token count, the output-token count, the latency from request-issued to response-streamed, the cost in USD, and the operator's rating-on-completion (when the operator rates). Studio already had the cost and latency tracking from the credit-ledger work; the rating signal is sparse but real.

That is the smallest experiment that can produce signal. No formal eval suite at this scale yet — the W16 eval-suite essay describes the suite I am still in the middle of building. What I have is production-cost-and-rating data, which is a coarser signal than evals but accumulates faster. Beck's argument is exactly that — start with the cheapest measurement that produces signal; refine the measurement when the data starts disagreeing with itself.

The data has not disagreed with itself yet. The patterns are clean.

The Pragmatic angle: substrate selection as Knowledge Portfolio

Andy Hunt and Dave Thomas's Knowledge Portfolio framing from W19 applies almost directly to substrate selection. The four operating rules — invest regularly, diversify, manage risk, review and rebalance — map onto routing decisions one-for-one.

Portfolio ruleSubstrate routing analog
Invest regularlyAdd a new provider every 4-6 weeks; do not let the registry stagnate
DiversifyHold providers with different cost / capability / sovereignty-class profiles
Manage riskPair high-quality high-cost frontier providers with cheaper open-weights fallbacks
Review and rebalanceQuarterly retrospective on routing data; promote providers that earned it, demote ones that did not

The W19 essay's portfolio audit talked about learning investments. The same shape applies to substrate selection. The W22 routing essay encoded the architecture; this essay is the first formal review-and-rebalance cycle the architecture supports.

What the registry looks like today

Seven adapters, with the routing-policy defaults that emerged from ninety days of production data:

AdapterSovereignty classDefault routing forApproximate per-call costQuality score (1-5)
Claude Sonnet 4.6US-frontierDefault agentic-chat, all coding by defaultmedium4.6
Claude Opus 4.6US-frontierHigh-stakes architectural decisions, Council synthesishigh4.8
GPT-5.4US-frontierDrafting, summarization, polishingmedium4.4
GPT-5.4 miniUS-frontierInner-loop / cheap reasoning, eval-judgelow3.8
Gemma 4 (31B Dense)open-weights / US-originPrivacy-sensitive content, on-laptop fallbackvery low (local)3.6
Cursor Composer 2open-weights / Kimi K2.5 fine-tuneMultilingual coding, rare-language code reviewmedium-low4.2
Kimi K2.5 (base)open-weights / non-US-silicon pathFallback when sovereignty constraints exclude US-frontierlow4.0

The quality scores are operator ratings averaged across roughly thirty thousand calls — coarse but stable. The cost ranges are categorical, not point estimates, because the actual numbers vary too much per call type to compress into a single figure. The default routing column is the operative output of the bakeoff: it tells the routing layer where to send each call type unless tenant policy or capability requirements override.

What the data taught me

Three patterns emerged that I had not fully predicted in W17 or W22.

Three findings from 90 days of routing data

  1. GPT-5.4 mini is dramatically more capable per dollar than I expected. Six weeks ago the W17 dispatch covered the launch. The intuition was "it'll handle some inner-loop work, but anything serious goes to Sonnet 4.6 or Opus." Three months of data say something different — GPT-5.4 mini handles inner-loop reasoning, eval-judging, drafting, and summarization at quality scores within 0.6 points of Sonnet 4.6 on those specific call types, at roughly one-tenth the cost. The default-routing column above shifted toward GPT-5.4 mini for the cheap-call category as a result. I expected the open-weights tier to dominate this slot; the small US-frontier model got there first.
  2. Composer 2's multilingual edge is real and underused. The W21 dispatch noted that Composer 2 (Kimi K2.5 fine-tune) ranked strongly on SWE-bench Multilingual. Three months of routing data says the multilingual edge is a load-bearing differentiator for a specific call type — code review on non-English-comment code, polyglot codebase work, anything with mixed-language identifiers. Composer 2 outperforms Sonnet 4.6 on these by 0.3 quality points on a 5-scale. I have not yet integrated it for general coding because the SWE-bench gap is real, but for the multilingual sub-category it is now the default. The Knowledge Portfolio diversification rule paid back here in a specific way I did not predict.
  3. Gemma 4's privacy-fallback role is the most important thing it does. Gemma 4's quality score is the lowest of the seven adapters — 3.6 against the others' 3.8-4.8. That looks like a problem until you look at which calls it gets routed to. Tenant-policy rules send any call tagged no_training_on_data or eu_data_residency to Gemma 4 by default, because it can run locally on the operator's machine with no data leaving the laptop. The quality score is averaged over a category of calls where the alternative is no model at all — sovereignty-constrained customers cannot route to US-frontier providers without violating their constraints. Gemma 4's quality is "good enough to be the privacy-fallback," and the privacy-fallback slot is the highest-leverage one in Studio's customer mix today.

The third finding is the one I am most glad I waited for empirical evidence on. The architecture-essay from W22 declared sovereignty-class routing as a first-class concern. Without a privacy-fallback adapter that actually runs in the role, the architecture would have been theoretical. Gemma 4 made it operational.

What the data taught me about my prior assumptions

Three things I was meaningfully wrong about ninety days ago.

Day 1 routing assumptions vs. Day 90 reality

Day 1 assumption
  • "Frontier models do everything; open weights are a hedge for outages"
  • "Cost differences across providers are 2-3x at most for similar-quality work"
  • "The routing layer's main job is reliability — handle vendor outages gracefully"
  • "Per-call cost-tracking is overhead I will rationalize later"
  • "Sovereignty-class routing is theoretical until the first restricted customer signs"
Day 90 reality
  • Frontier models do some things; open-weights does others meaningfully better at lower cost
  • Cost differences across providers are 6-20x for inner-loop work; the routing has to capture the asymmetry
  • The routing layer's main job is cost-per-quality optimization; reliability is a side-effect
  • Per-call cost tracking is the data-source the bakeoff entirely depends on; would not have this essay without it
  • Sovereignty-class routing produces visible product behavior even before the first restricted customer signs (privacy-fallback to Gemma 4 covers a class of calls that would otherwise have no path)

The asymmetry-of-cost finding is the one with the most operational impact. A naive routing layer that always picks the highest-quality provider produces unit economics that do not scale. The same workload routed empirically across the bakeoff data produces roughly forty-three percent cost reduction at no measurable quality loss for the call types where mini-tier or open-weights adapters earned their default-routing slot. The bakeoff paid for itself thirty days into the ninety-day window.

Three pitfalls I avoided that would have cost me

Naming the failure modes I sidestepped is more useful than naming the wins. Three patterns I would warn another founder about.

The third discipline is the one that the W22 architecture essay committed to in advance. Thirty thousand calls of production data later, it has held. Adapter metadata is the right place to encode constraints. The routing function is supposed to be small.

What this implies if you are running a multi-vendor gateway

Three suggestions, each costly enough that I considered cutting them and was wrong each time.

One. Track per-call cost the day you start routing across providers. The data is the bakeoff. Without it, you have intuitions; with it, you have decisions you can defend. The cost of the tracking is small (one event per call, append-only); the benefit compounds with every adapter you add.

Two. Run the smallest-experiment integration discipline. New adapters land at low default-routing weight, accumulate two weeks of production rating data, then get promoted or demoted on evidence. The cost of the discipline is occasional under-utilization of strong adapters in their first two weeks; the benefit is that you do not over-promote on theoretical numbers and have to reverse later.

Three. Run the quarterly review-and-rebalance as a forcing function. The Pragmatic Knowledge Portfolio rules apply. Adapters that have not earned their slot get demoted. Adapters that have earned more get promoted. The discipline keeps the registry honest. The cost is one essay's worth of analysis per quarter (this essay being the first); the benefit is a registry that compounds rather than calcifies.

What I am NOT publishing

The exact rating numbers, per-call cost figures, and operator names from the ninety-day data set. The aggregate patterns are above; the individual data points belong to the operators who produced them. Studio's privacy posture is one of the things customers can evaluate, and the bakeoff data is part of that posture. The architecture-essay version of these numbers is what the public should see; the per-customer version is what the customer's audit log should hold.

The bakeoff is the discipline. The architecture is the affordance. The thirty thousand calls are the data. Each compounded into the others over ninety-five days, and the result is a routing layer that does not just function — it produces decisions I can defend with numbers, against which any future adapter has to earn its slot.

The next quarterly retro ships in mid-July, after another ninety days of data. By then DeepSeek V4 will be in the registry. Mythos's Glasswing-tier capability will or will not have opened access paths that change the bakeoff. The first founding customer will be running calls through Studio under their own tenant policy. The retro will either confirm the patterns from the first ninety days or revise them on evidence — and either result will be more useful than continuing to operate without a quarterly bakeoff.

I will write the next one in July, on a Tuesday, with the data.

Was this page helpful?

O arco de 27 semanas · Um corpo de trabalho

Vinte e sete semanas. Dois textos por semana. Seis meses de escrita durante a construção.

Semana

Ensaio de terça

Boletim de sexta