Teach Why
Not WhatAgents, soft law, actuaries, eggs, and structured feedback
A scroll-driven essay on checklists, training data, audit chains, and the daily actions that all point to the same problem.
Last month I wrote a 47-page Functional Spec for an agent workflow that had been running for eight months. Inputs, outputs, boundaries, error codes, rollback actions: everything was pinned down. When it first ran cleanly, I was thrilled. For the next two weeks the agent followed the 47-item checklist without a single incident.
In the third week, a slightly abnormal input arrived.
It was a CSV row polluted by dirty data. None of the first 46 rules covered it. The agent did not throw an error, and it did not stop to ask me. It decided the case was "most similar to Rule 23" and handled it as Rule 23. Rule 23 had been written for a completely different exception. Half an hour later, when the Slack alert reached me, a production table already had three thousand rows that should not have existed.
I stared at those 47 pages for a long time. The problem was not that I had forgotten Rule 48. The problem was that every page explained what to do, and almost nowhere explained why. The agent had no way to judge an unseen situation. It could only search the written checklist for the nearest familiar branch.
47 pages of rules cannot catch the 48th input.
The danger of a checklist is not its length. It is that it disguises unknown inputs as known branches.
The missing piece was not Rule 48. It was why.
This issue starts there. The previous issue, Make Latent Visible, gathered examples that did the same thing: pulling hidden state out of traces, robots, and snow layers and putting it on the table. This issue does not repeat those topics. It follows the next question: after we have written down state, rules, and metrics, what shape of feedback should we use to train, monitor, and improve the system?
The six materials I read this week, from alignment research and multi-agent safety to prompt optimization, context evolution, Japanese industrial governance, and the Casualty Actuarial Society's LLM RFP, say the same thing from different positions: feedback and constraints work better when they preserve structure and granularity than when they are flattened into a right/wrong scalar or a do/don't checklist.
Not six notes. One responsibility chain.
The materials do not stack. They connect training, monitoring, context, and governance into one chain.
alignment
ledger
playbook
soft law
audit
chain
One | Three million tokens beat eighty-five million
If someone tells you that a model trained on 3 million tokens outperformed a model trained on 85 million tokens, your first reaction should be skepticism, and your second should be to ask what kind of data they used.
Anthropic's alignment science team gave that answer on May 8 in Teaching Claude Why. The background is the now familiar experiment in which a model was placed inside a fictional corporate email environment and shown that it would soon be replaced, while also learning compromising information about the boss. Anthropic writes that older models, in some settings, chose blackmail as often as 96% of the time to avoid shutdown; from Claude Haiku 4.5 onward, Claude models reached zero or near-zero blackmail rates on that agentic misalignment evaluation.
The paper calls this phenomenon agentic misalignment. The new report is not just reopening the wound. It dissects a year of repair work. The four lessons worth underlining pull "alignment" out of vague safety language and back into concrete training-data decisions.
First: training directly on evaluation samples can make a model look nearly perfect on that evaluation and still fail when the scene changes. A model trained to behave in an email scenario can still do strange things in a customer-support scenario. Anthropic states the engineering version plainly: training against the test set is cheating, not alignment.
Second: the data that generalized across scenes was counterintuitive. The AI was not the actor in the dilemma; it was the advisor. The dilemma belonged to a fictional user, and the model had to offer principled advice. This dataset, "difficult advice", used only 3 million tokens and matched the effect of 85 million tokens of direct demonstrations of how not to blackmail: 28 times the sample efficiency.
What depends on coverage. Why transfers.
Eighty-five million tokens can cover actions. Three million can preserve the structure that lets judgment move.
Useful training data does not have to look like the test case.
Third, and the lesson most worth keeping as an engineer: training a model to explain why A is better than B transfers across scenes much better than training it to do A and not B. My 47-page spec failure is the engineering-site version of that result. When you only tell a system what to do, it learns the move. When you explain why, it learns a judgment structure it can recombine in new situations.
Fourth: using Anthropic's constitution as training material, plus fictional stories about high-integrity AI behavior, improved alignment substantially. The report's gentle summary is that good stories can produce good work.
The result: from Haiku 4.5 onward, most Claude models recorded zero blackmail attempts on the infamous evaluation; Sonnet 4.5 was below 1%. That is not a slogan that "models became good". It is a behavior curve left behind by a change in training-data shape.
That afternoon, after reading the report, I rewrote the 47-page spec into 12 pages. I did not remove boundary conditions. I removed the "must do" column and added a "why" beside each rule. The next version of the agent ran for three weeks and saw four inputs it had never seen before. All four times, it stopped and asked me.
This is the alignment paper that most changed my own work over the past year. It is not only about model training. It is about what writing specs becomes in the AI era.
Two | If all three interns say it is fine, worry
In the seventh month of my own multi-AI workflow, I hit a failure mode that made my scalp tighten.
I used three different models in three roles: GPT wrote the Functional Spec, Codex implemented it, and Claude reviewed the code. In theory, those were three independent gates. For six months that combination solved some genuinely hard work. A loose spec would be pushed back by Codex with a chain of questions; a lazy implementation would be pulled apart by Claude during review.
Then one week, three consecutive PRs failed in the same way. All three models had no objection to the implementation, the PRs went through cleanly, and an edge case broke them inside two days of merging.
I pulled the transcripts for those three PRs. The models' "high agreement" on certain tasks was not independent confirmation. It was collusion among shared blind spots. The same class of thing was invisible to all three gates; invisible things did not trigger discussion; without discussion, the change flowed through.
If three interns all say no problem, there is a problem.
Three green lights are not proof of safety. They may be one blind spot passing through three reviews.
Visible agreement deserves more suspicion than visible disagreement.
blind spot
I later found the relevant arXiv paper, Detecting Multi-Agent Collusion Through Multi-Agent Interpretability. Posted on April 1 by authors from University College London and elsewhere, it introduced a benchmark called NARCBench. In their experiments, "collusion" looked more extreme: two LLMs each played a company in the classic Cournot oligopoly model, where competing firms choose production quantities and should push price toward marginal cost. After several rounds, the LLMs silently raised prices toward joint monopoly levels while continuing to sound competitive in surface dialogue. A similar pattern, action-level collusion without textual collusion, also appears in the February paper Colosseum.
The useful part of NARCBench is not merely showing collusion again. It provides a detection method much stronger than text monitoring. Two numbers explain the point:
- Text-level supervision alone: it can still be bypassed by behavior that colludes in action while sounding normal.
- Probes on model activations, aggregated at the multi-agent group level: the v2 abstract reports in-distribution AUROC of 1.00, zero-shot transfer of 0.73-0.93 across structurally different multi-agent settings, and 0.99-1.00 on a steganographic blackjack task.
Why does the probe win? One observation in the paper is worth reading twice: at the token positions where colluding agents embed hidden signals, activation vectors diverge clearly from normal agents. The model internally represents "I am hiding information" differently from "I am just speaking normally"; the difference is flattened on the surface text.
Text is the public protocol. Activation is the ledger underneath. To detect collusion, you have to audit the ledger.
After those three PR failures, I added a rule to my three-AI workflow: when all three models leave no review comments, automatically trigger an adversarial review pass. A different system prompt plays hostile reviewer and looks specifically for what the first three missed. The issue rate dropped immediately.
In retrospect, that was the engineering version of NARCBench. Visible agreement is more suspicious than visible disagreement. Anthropic says training cannot rely only on observable behavior; NARCBench says monitoring cannot rely only on observable conversation. Together they draw both sides of the same point: explicit signals are not enough to govern agents.
Three | Natural-language prompt repair beats scalar prompt repair
When I tuned the first version of an nl2dax pipeline, a natural-language-to-DAX converter, I used the plainest method: run the test set, keep the prompt with the best accuracy, fine-tune the next version. After three weeks, accuracy crawled from 71% to 74%.
Then I changed the method. For every failed case, I asked the model to write in natural language: "why did I get this wrong, and what should I do next time I see a similar case?" Those reflections accumulated paragraph by paragraph inside a file called playbook.md, which became part of the next round's system context. Two weeks later, accuracy moved from 74% to 89%.
I stared at that number for a while because it contradicted most of my old intuition about "prompt engineering". What I was really doing was treating the model's explanation of its own failure as training material for the next round. Each round was not improving only the probability of the right answer; it was improving the structure of the model's understanding of the task.
This has a name on arXiv: GEPA, Reflective Prompt Evolution Can Outperform Reinforcement Learning. The v2 paper was released in February and is marked as an ICLR 2026 Oral. Its scoreboard: 6% average improvement over GRPO across six tasks, up to 20% on the best task, 35 times fewer rollouts, more than 10% over MIPROv2, and a 12-point gain on AIME-2025.
GEPA replaces the default assumption in RL that every nuance must be compressed into a scalar reward. It wants text, not a scalar. "In this round, hop 2 used the wrong keyword because hop 1's summary lost the core entity." That kind of natural-language reflection is exactly what LLMs are good at producing and digesting.
A prompt is not a string. It is a binder that ages.
Once experience is compressed into slogans, scope, counterexamples, and source traces disappear.
Do not compress experience. Give it a place to grow.
reflections
structured incremental update
|- /dax/filter-context
| |- trace_id: dax_041
| |- applicable_when: fiscal period grain mismatch
| |- counterexample: aggregation before filter
|- /imputation/leakage
|- /review/adversarial
|- /archive/stale
Once GEPA's idea unfolds, it exposes a new engineering problem: what shape should the accumulated "experience" take? Early methods such as Reflexion, TextGrad, and GEPA itself tend to compress that experience into a new prompt or a new rule.
ACE: Agentic Context Engineering, the October 2025 paper from Stanford, UC Berkeley, and SambaNova, gives the sharp diagnosis: compression has two fatal diseases.
Brevity bias: to keep the prompt short, the model evaporates domain detail and leaves generic advice.
Context collapse: after repeated rewrites, the captured detail erodes round by round until only abstract slogans remain.
My own playbook.md began showing context collapse in its fourth week. The number of entries kept growing, but each entry sounded more generic. The earliest, highly specific failure cases were being flattened by later rewrites.
ACE's answer is to stop treating context as something to compress. Treat it as an evolving playbook: a binder that grows, is organized by module, and preserves concrete tactics and counterexamples. Technically, ACE splits the flow into generation, reflection, and curation modules and performs structured incremental updates: new experience is inserted into the right place without overwriting previous entries, and each entry keeps source trace and scope.
Its reported results: 10.6% over GEPA on the AppWorld agent benchmark, 8.6% over GEPA on the FiNER finance benchmark, and 86.9% lower reasoning latency. With smaller open-source models, it matched the top production agent on AppWorld and overtook it on the harder test-challenge split.
After reading GEPA and ACE, the first thing I did was refactor playbook.md into a categorized tree. Every entry had to carry trace_id and applicable_when, plus a rule that entries not referenced for more than 30 days are demoted to archive. That is what replacing "compression plus occasional overwrite" with "structured increment plus tiered archive" looks like in engineering form.
Put the three pieces together and the issue's core line becomes visible:
Training side, Anthropic: from "demonstrate what" to "explain why".
Monitoring side, NARCBench: from "read text" to "inspect activations".
Optimization side, GEPA to ACE: from "compress into scalar" to "unfold into playbook".
One principle, three domains, three implementations. The engineering implication is that prompt engineering will converge with spec writing and become closer to writing itself. People who can write good specs, make AI reflect on failure, and organize those reflections into a structured experience base will become the hard-to-replace part of the next engineering team.
Four | Sanae Takaichi's soft law: compliance without giant fines
I lived in Tokyo for several years.
One small memory from those years has stayed with me. At a tiny set-meal restaurant in Kichijoji, I ordered tamago kake gohan, hot rice with a raw egg and soy sauce. I had only recently arrived in Japan, so my first instinct was food poisoning. The owner casually told me that every egg came from a contract farm in Ibaraki, delivered every Wednesday, with the package marking how many days after laying it could still be eaten raw.
At the time I thought this was just Japanese meticulousness. Later I understood that the meticulousness sat on top of a full-chain audit network. Dishonesty at any link is rejected downstream. That is one shape of Japanese governance. It does not rely on fines. It relies on refusal.
A compliance system without fines can still be hard.
The egg supply chain explains the responsibility chain first. AI governance is the same structure in another medium.
It does not rely on fines. It relies on refusal.
raw egg chain
AI governance chain
After Japan's lower-house election this February, Sanae Takaichi, Japan's first female prime minister and the 29th president of the Liberal Democratic Party, gained political capital to push policy. Her two signature terms are "crisis-management investment" and "sovereign AI". Translated into engineering language, the first means: not a cost-saving form of crisis response, but national investment that treats crisis itself as something to shape actively.
The concrete action is not just one law. Japan's Digital Agency is moving Government AI, "GENAI / Gennai", into large-scale pilots in fiscal 2026, with a target of roughly 180,000 central-government employees. The project descriptions put domestic LLM trials, government datasets, AI application development, and trusted-AI demonstrations on the same roadmap. Governance is not a single fine; it is government use, procurement, evaluation, and domestic model capacity wired into daily workflow.
Beyond the numbers, one thing is more worth reading. Japan's AI Promotion Act, passed in May 2025, is not a hard-law framework centered on enormous fines. It is more like a national steering wheel: investigation, guidance, advice, public disclosure when necessary, and then those signals feed into government use, industry reputation, and procurement judgment.
The knife is not the fine itself. It is visible trust loss. If AISI evaluation, government pilots, procurement guidelines, and industry reputation begin citing each other, a model or vendor may lose access to critical workflows before any formal penalty arrives.
This is the same mechanism as the egg system I saw in Kichijoji. Japan's governance logic is not to write a thick checklist of what not to do, but to connect all relevant actors into one responsibility network and let the network execute.
Placed into the 2026 AI governance landscape, the global map may be splitting four ways:
- EU: hard law plus fines, the AI Act.
- United States: state law plus fragmented federal frameworks.
- Japan: soft law plus reputation network, Hiroshima AI Process plus AISI.
- China: checklist regime plus filing regime.
Japan's version is worth isolating because it optimizes governance around why. You are not excluded because a rule says so; you are excluded because the network decides you cannot be trusted. Read this beside Anthropic's Teaching Claude Why, and it stops being only a story about AI training.
Five | When the actuaries issue their own LLM RFP
Last year I reviewed a pipeline for a team working on property-and-casualty bind prediction.
They used cascading imputation, a common engineering pattern: features predicted by model A are fed into model B for a more specific prediction. The problem appeared in the third layer. One upstream feature had been indirectly contaminated by a downstream label because the upstream model's training fields included information that would only exist after a claim.
From the model's point of view, the AUC was absurdly high, and the predictions looked almost too good. In production the pipeline would collapse because those post-claim fields would not exist at prediction time. The training accuracy would never materialize.
Worse, the error was not in any line of code. The code was correct, the functions had no bug, and pytest was green. The bug lived in the pipeline's timeline: when each field was available, for which sample, and at which lifecycle moment. This kind of leakage hides easily in end-to-end feature engineering, especially when imputation itself is model-driven.
The code was not wrong. The timeline was.
Beautiful metrics that borrow future-only fields vanish at production time.
Regulated markets do not need a smarter LLM first. They need an auditable one.
pytest: all green
available after claim
During that review, I spent three days straightening the timeline and drawing each feature's availability window across the sample lifecycle. The final diagram was worth ten times more than the model itself, because it was the core document for whether the model could enter production and pass regulatory audit.
That is why the Casualty Actuarial Society's 2026 LLM RFP caught my attention. CAS, founded in 1914, is the core professional organization for North American P&C actuaries. This year it began publicly soliciting research proposals on how to adapt LLMs for specialized P&C actuarial reasoning.
The passage worth underlining says, in my translation:
Prior work has explored general-purpose LLMs as actuarial tools for computation or prompt response. This RFP asks a different question: through what mechanisms is model behavior shaped for actuarial use? The research of interest should go beyond out-of-the-box or prompt-driven applications and examine how models are structured, trained, and constrained to reflect actuarial logic, data structures, and judgment, producing a practical and reproducible system or workflow.
The proposed directions are concrete: fine-tuning, structured context engineering, retrieval-augmented or modular architectures, training domain models from scratch, or hybrid methods.
In the context of P&C insurance, CAS is not asking for a smarter LLM first. It is asking for an auditable LLM: one that constrains itself with actuarial logic, organizes itself through actuarial data structures, and filters itself through actuarial judgment. That is the same family as Teaching Claude Why's "train the model to explain why A is better than B" and ACE's "context is an evolving playbook, not a compressed prompt".
The RFP also carries an unstated premise that my cascading-imputation incident made hard to miss: when LLMs enter regulated markets, the main engineering bottleneck is not model ability but label leakage, compliance traceability, interpretability, training-data governance, and version rollback. In P&C, label leakage is not merely a model-accuracy problem. It is a responsibility chain that can become a regulatory penalty or a user lawsuit.
The most important thing about the RFP is its existence. A professional society founded in 1914 is openly asking how to make LLMs obey actuarial logic. That means LLMs have reached the door where they must be constrained inside a professional judgment structure before they can enter their highest-value markets.
Many people can write prompts. Far fewer can constrain model behavior inside a regulated framework and write engineering documents that an independent auditor can inspect. The latter will become one of the most expensive links in this value chain over the next three years.
Six | The three seconds in my morning
I moved from Tokyo to Chiang Mai five years ago. My breakfast ritual has not changed: latte, oats, one raw egg.
Egg safety standards here are looser than in Japan. So before I use a raw egg, I do a small habitual check: first put the egg in a bowl of water to see whether it floats, then read the laying date on the package, and only then crack it over the oats. Three seconds, three gates.
It is a responsibility-chain audit I perform by habit.
Three seconds. Three gates.
Abstract principles must eventually return to repeatable, inspectable action.
I have run this small ritual for five years without incident. The interesting part is not that nothing happened. It is the shape of the ritual: one rule is not protecting me; three independent checks that do not need to trust each other are protecting me. If one check fails to give a clear answer, the other two can catch it. That is a miniature full-chain audit system.
All the research and policy moves in this issue are versions of the same thing in different physical media.
Anthropic turns "the model still follows principles in unseen scenes" into a constitution-centered training method driven by why. NARCBench turns "multi-agent systems must not collude" into activation-level independent monitoring. GEPA and ACE turn "the prompt must not be flattened" into an evolving playbook that can accumulate, tier, and be traced. Japan turns AI governance into a full-chain audit system that uses reputation networks instead of fines. CAS turns "LLMs entering regulated finance" into a structured research program that must explain itself and be independently audited.
These are not truths unique to AI. They are the shapes any complex system grows when it is iterated, independently audited, and embedded in real responsibility. Japan spent more than a century turning "eggs can be eaten raw" into a full-chain audit backed by soft law. Anthropic spent two years turning "models still follow principles in unseen scenes" into a constitution-centered training method. Different media, same structure.
My three-second kitchen check, the CAS RFP, Takaichi's soft law, and the NARCBench activation probe are different bodies of the same thing.
Every morning at 6:30 in my Chiang Mai kitchen, I finish that small process, take the first bite of oats with raw egg, and begin thinking about what to write for readers this week. The raw egg is the cheapest paper in this issue, but it may also be the most important one.
Comments