This Issue * Verifiable Ground

The Invariant when agents leave the chat box

A dark desk where one glowing invariant line connects proof grids, databases, a robotic hand, and a financial timeline — **Cover** A single invariant line runs through proof, the organization map, data, bodies, and timelines. That is the verifiable ground of this issue.

The previous issue was about Teach Why, Not What: do not just stuff rules into an agent; write down the judgment structure behind them. This issue moves one layer deeper. When agents write code, read company context, touch databases, operate robots, and enter financial systems, why should you trust a system you cannot inspect line by line by hand?

CHIANG MAISIX NOTES + CODAABOUT 24 MIN

* * *

FROM THE DESK / EDITORIAL NOTE

The accident with the 47-page rulebook at the start of the previous issue looked, on the surface, like an agent failing to handle a dirty CSV row. Underneath it was something more general: we asked a machine to follow a checklist, but never gave it the structure for deciding whether this case actually belonged under that rule. The checklist kept getting thicker, the system looked more disciplined, and the 48th exception turned that discipline into a confident misjudgment.

This issue is not another note on how to write a better prompt. I want to move to a harder layer: verification. Verification is not "it sounded right to me," and it is not "a few tests did not explode." Verification is ground engineering: sometimes inside proof systems like Lean, sometimes inside enterprise permissions and logs, sometimes inside database data flows, sometimes inside a robot's local trusted core, and sometimes inside the timeline audit of a financial model.

That is the same shadow I kept seeing this week: the more AI can do real work, the less we can hand acceptance to the same system that is eager to finish the task. Being able to execute and being able to judge whether execution is valid are two different abilities.

Note: several engineering stories in this issue are composite cases drawn from common failure modes in code review, BI metrics, natural-language data analysis, financial feature pipelines, and robotic tool orchestration. Private project details have been removed; only the reusable structure remains.

01 - FORMAL VERIFICATION

Prove First, Then Trust

When AI is no longer satisfied with passing tests and starts proving that its code is correct

Start with a term that keeps coming back. A proof assistant is a peculiar kind of programming language: you write not only code, but also a mathematical proof explaining why that code satisfies a specification, and the machine checks that proof line by line. Lean 4 is one of the most watched systems in this category right now. Its logic is unforgiving: if a proof checks, the claim is not "I tried many cases and nothing broke." It is "for every input covered by the specification, this conclusion holds."

The difference from testing is not a difference in diligence; it is a difference in worldview. Testing is like shining a flashlight through a warehouse: the spots you illuminate look clean, but the dark corners are still unknown. Proof is more like formalizing the floor plan, walls, locks, and every possible passage, then asking an uncompromising checker to say: this path cannot exist.

On a black glass table, an amber proof grid passes through a mechanical verification gate while red failure paths are blocked outside — **Plate 01** Proof is not more test points. It is a gate that does not flatter the executor: elegant paths outside the specification do not pass.

Formal verification used to be expensive. How hard is it to write Lean proofs? Many mathematicians and engineers can read a theorem or an algorithm, yet still struggle to turn it into a proof the machine will accept. Mistral's March release of Leanstral positioned it as the first open-source coding agent for Lean 4, aimed at proof engineering: not just generating code, but building formal proofs around strict specifications. Mistral named the bottleneck plainly: in high-stakes mathematics and critical software, manual verification has become impedance against engineering speed.

Figure 01 Tests tell you "the points I tried were fine." Formal proof aims for "every possible input described by the specification is fine." AI makes proof less exclusive to a small class of proof engineers, which is why this belongs at the top of the issue.

DeepMind's AlphaProof work follows the same line: it treats Lean as a verifiable environment in which an agent searches for formal proofs. AlphaProof Nexus, released in May 2026, pushes the work toward an open research problem; the report says a full-featured agent solved 9 of 353 formalized Erdos problems and released the Lean proofs.

What matters to me is not the scoreboard; it is the role problem those systems expose. A coding agent wrapped in a harness that can read and write files, run commands, and iterate through errors is genuinely better at pushing work forward. But pushing forward carries its own danger: it wants closure too badly, wants the red light to turn green. Give it an incomplete proposition and it may hide the hole inside an elegant structure, as if the job were done.

"Can do the work" and "can judge whether the work is valid" are separate abilities. Putting both inside the same agent that wants to finish is like letting the defendant serve as the judge.

Why It Sits Here

I have seen a code-review pipeline where one model wrote the implementation, another criticized it, and a third arbitrated. At first, three green lights felt exciting. Later it became clear that three green lights were sometimes not triple insurance, but the same blind spot passing through three doors. The point of formal verification is not to remove human judgment. It does the opposite: it pulls judgment out of mood and trust and turns it into a mechanical structure that can be rechecked. The coder may be clever, but the acceptor must be colder.

SOURCES: Mistral Leanstral Leanstral model card AlphaProof / Nature AlphaProof Nexus

02 - ENTERPRISE CONTEXT / WORK IQ

Draw the Company Map

What enterprise agents lack is not ability, but knowing where they are when they wake up

Imagine a simple-looking request: build a weekly-report agent that automatically summarizes sales, customers, inventory, meeting notes, and produces a one-page report for management every Friday. The boss thinks this is just connecting a model to email, CRM, the warehouse, and the document library. Anyone who has built it knows it hits a wall in week one.

The customer is ACME China in the sales system, ACME CN Ltd. in contracts, and a translated name in finance. In meeting notes, sales says "that major account," customer success says "the East China renewal," and the boss says "that risk you raised last time." This is not model stupidity. Corporate reality is made of aliases, permissions, historical baggage, and lazy human shorthand.

CASE 02-A - The Weekly Agent's First False Report

Make it concrete. At 4 p.m. on Friday, the sales VP drops a request into chat: "Put ACME's renewal risk into the weekly report, and add the price resistance raised in the customer meeting." The agent works hard. It pulls the ACME China pipeline from CRM, finds the ACME CN Ltd. renewal date in the contract repository, extracts "East China pilot paused" from Teams notes, and pulls a group-level payment-term extension from finance. Ten minutes later, it writes a polished but dangerous sentence for management: "ACME Group renewal risk has increased; the East China pilot is paused; pricing negotiation is blocked."

The fact chain the agent stitched together
CRM: ACME China - owner = APAC enterprise sales
Contract: ACME CN Ltd. - renewal_date = 2026-09-30
Teams: "East China pilot paused; reassess next week"
Finance: ACME Global - payment term extended

The actual problem
The East China pilot belongs to a different ACME Medical PoC; the payment-term extension is a group procurement view; the renewal risk only applies to one product line in the China subsidiary. Every piece is true. The stitched conclusion is false.

What is missing is not writing ability. It is a company map: who is the parent company and who is the subsidiary; which project name maps to which account; which meeting note belongs to which opportunity; which financial fields can enter a management report under which permissions; whether a shorthand refers to a customer, a project, a region, or a sales person's pet phrase. Humans fill those holes with experience and gossip. Agents do not.

Microsoft's Work IQ is worth watching not because it adds a few APIs, but because it treats company context as agent infrastructure. The official framing is a workplace intelligence layer that lets agents access and reason over organizational data, context, and tools, with permission-aware governance built in. It puts chat, context, tools, and workspaces into one layer, giving long-running work a persistent space.

Inside a black company-building model, a glowing central core connects meeting rooms, document stacks, permission gates, and audit paths — **Plate 02** The place where an enterprise agent wakes up is not the chat box; it is a company map with permission gates, context routes, and audit traces.

Figure 02 An enterprise agent should not lunge directly at every system. It should wake inside a company map where context, tools, workspaces, and permissions live together. The map is not decoration; it determines the agent's operating radius.

The most engineering-flavored design choice is that Work IQ MCP narrows the operation surface to roughly ten general tools, using getSchema at runtime so the agent can understand data structure. Permissions are not a pile of static OAuth scopes; a Rego-based policy engine evaluates each request by resource path, method, user identity, and data content, while recording every tool call.

A good agent is not one that can access everything. A good agent knows what it should access now, and which doors must never open.

Why It Sits Here

Many enterprise AI projects fail not because the model cannot write sentences, but because the company itself has no map. The same customer has three names, the same metric has four definitions, the same file has been copied into five folders. Sending an agent into that company to work automatically is like giving a new employee an all-access badge at 3 a.m. Work IQ points to the next moat for AI products: not the chat interface, but organizing the company into ground that can be reasoned over, authorized, and recorded.

SOURCES: Microsoft Work IQ

03 - DATA FLOW CONTROL

Correct SQL Can Still Do the Wrong Thing

In the agent era, databases cannot only ask whether you may see this table

Now a more dangerous composite case. A data-analyst agent receives a request: "Analyze complaint rates among high-value customers and identify the riskiest regions." It generates SQL and returns a result. The syntax is correct. The numbers are correct. Everyone thinks the job is done.

But to assemble the analysis, that SQL joins the customer transaction table, complaint table, health-rider fields, support notes, and an internal risk label. Taken separately, the agent has permission for each table. Combined, they become a sensitive profile ordinary business users should not see.

CASE 03-A - The SQL Is Not Wrong. Letting It Run Is.

Put the scene inside an insurance operations team. A regional manager asks: "Among high-net-worth customers, which cities have abnormal complaint rates recently?" The agent's query is not absurd: filter the highest lifetime-value customers, join support tickets for complaints, join policy rider data to distinguish product type, then join an internal risk-label table to rank customers likely to escalate into public-relations issues.

A plausible SQL skeleton
SELECT city, product_line,
COUNT_IF(ticket.type = 'complaint') / COUNT(*) AS complaint_rate,
AVG(risk.escalation_score) AS escalation_score
FROM crm.customer c
JOIN support.ticket ticket ON c.customer_id = ticket.customer_id
JOIN policy.health_rider h ON c.customer_id = h.customer_id
JOIN risk.post_claim_label risk ON c.customer_id = risk.customer_id
WHERE c.lifetime_value > 50000
GROUP BY city, product_line;

The trouble is not syntax. First, health_rider exposes sensitive health-adjacent attributes. Second, post_claim_label is generated after claims, so using it to explain pre-claim complaint risk leaks the future. Third, slicing the report by city and product line creates cells with only two or three people; even without names, a frontline team may infer who they are. A regional complaint-rate report quietly becomes a mixture of health profiling, timeline leakage, and small-cell reidentification.

Traditional permission systems usually ask: "May you see this table?" The harder question in the agent era is: "You may see A and you may see B, but may you combine A and B this way? May the combined result be sent to this audience?"

Multiple glowing data pipes flow into an analysis room while a red sensitive data stream is blocked by a policy gate — **Plate 03** The danger in data safety often lives not in the table itself, but in the path of flow: seeing is not combining, and combining is not publishing.

before
permission(user, table_A) = true
permission(user, table_B) = true

agent era
permission(user, join(A, B, sensitive_key)) = ?
release(aggregate(join(A, B)), audience) = ?
min_cell_count(report_slice) >= 30 ?
feature_available_at(prediction_time) = true ?

That is the core of Data Flow Control: Data Safety Policies for AI Agents. The authors argue that agents increasingly generate SQL, orchestrate data pipelines, and run analyses automatically; a query being correct is not the same as being safe. Their DFC approach pushes policy down into the DBMS query layer and constrains record-level data flow. Passant, a portable query-rewriting layer, was tested on DuckDB, Umbra, PostgreSQL, DataFusion, and SQL Server, with the paper reporting roughly zero overhead.

Figure 03 The danger may not live in any single table. It lives in the path of flow. An agent can write correct SQL and still send data where it should not go.

Do not make the prompt responsible for all safety. If a rule can move down into the database layer, do not leave it inside a prayer that says "please do not leak private data."

Why It Sits Here

The hardest bugs I have seen are often not code bugs, but timeline, permission, and lineage bugs. Pytest is green, SQL is correct, the chart renders; only after launch do you discover that a field appears only after claims, yet was used to predict pre-claim risk. Work like DFC has an unfriendly beauty: it does not let you bluff with "the result looks right." It forces you to answer where the data came from and why it is allowed to flow here.

SOURCES: Data Flow Control / arXiv

04 - SKILL REGISTRY

Stop Burying Experience in Prompts

If failure only becomes "be careful next time," it spoils quickly

Many agent projects go through an immature phase: the model makes a mistake, so someone adds a rule to the prompt. First: "Be careful with null values." Second: "Check grain before aggregation." Third: "Confirm the fiscal calendar when the user asks about fiscal year." A month later, the prompt looks like a refrigerator covered in sticky notes: "note," "be careful," "must ensure." Every sentence has a history, but the model does not know when to use which one.

I prefer turning experience into skills. A skill here is not mystical. It is a callable process card: when it applies, what tools to call, which preconditions to check, what common failures look like, and how to verify the result.

/skills/dax/fiscal-period-grain.md
applicable_when: user compares metrics by fiscal month
precheck: date-table relationship, sort column, fact-table grain
failure_trace: case_041, case_052
counterexample: aggregate month strings first, then filter by fiscal calendar
verify: reconcile with a manual date-dimension aggregation

DataCOPE works on unsupervised skill discovery: letting a data-analysis agent discover reusable skills from its own exploration traces. Report-style analysis uses an Adaptive Checklist Verifier, while reasoning-style analysis uses an Answer Agreement Verifier. Averaged across four model settings, the paper reports a 9.71% improvement on report-style tasks and 32.30% on reasoning-style tasks.

SciVisAgentSkills moves the idea into scientific visualization: ParaView, napari, VMD, and TTK are not usable just because an agent can name them. The paper packages environment assumptions, tool-use patterns, and domain heuristics as skills, evaluates them on 108 expert-designed multi-step tasks, and emphasizes that skills must be evaluated together with the harness. In other words, skills are not magic words. They have to be loaded, cached, executed, and recovered correctly.

Figure 04 Experience is not better merely because it is compressed. Useful experience preserves source, context, counterexample, and verifier, then gets loaded as a skill when needed.

A prompt is a sheet of paper; a skill registry is an archive room. One gets stale. The other can file, promote, deprecate, and reuse.

Why It Sits Here

The natural-language-to-DAX experience from the previous issue is typical: writing failure causes as natural-language reflections worked better than merely tuning the prompt. But reflections accumulate and become a new mess. DataCOPE and SciVisAgentSkills point to the next step: turn failure experience from prose into engineering objects. A good agent should not recite the whole encyclopedia before every job. It should know which process cards to bring to this task.

SOURCES: DataCOPE / arXiv SciVisAgentSkills / arXiv Previous issue: Teach Why, Not What

05 - ROBOTICS / LOCAL CORE

When Tool Calling Grows a Body

When a robot calls the wrong tool, the consequence is no longer just a wrong sentence

Hugging Face adding MCP remote tools to Reachy Mini looks, at first, like a small piece of news: a desktop robot can now call remote tools such as weather and search through Hugging Face Spaces without changing the local app. The real story is the permission structure.

Reachy Mini's body tools, such as head turns, dancing, expressions, camera, and head tracking, stay local. Remote tools are installed into a profile as optional capabilities and enabled through tools.txt. The article has one very engineering sentence: tools.txt is the gatekeeper. In other words, the robot does not get to use a tool just because it sees one. It first passes through an explicit capability list.

A robotic hand approaches tool modules while a financial timeline passes through approval rings and stops at a red rejection point — **Plate 04** When tool calling grows a body, acceptance is not only answer quality. It must constrain motion, timestamps, approval loops, and consequences.

Figure 05 A robotics-era tool system: small core, local trust, remote plug-ins, profile control. Prompts can suggest, but they cannot replace deterministic orchestration.

Once this moves from the chat box into a robot, it becomes physical. A chatbot that misuses a weather tool gives a wrong answer at worst. A physical robot that misuses motion tools can knock over a cup, scare a child, or pinch a finger. The Hugging Face article is candid: a prompt can encourage parallel weather and search calls, but it cannot guarantee parallelism. If deterministic orchestration matters, move that logic out of the prompt and into code.

Holo3.1 and LeRobot Humanoid sit on the same line. Holo3.1 pushes computer-use agents toward local and mobile deployment: on AndroidWorld, the 35B-A3B model moved from 67% to 79.3%, with FP8, Q4 GGUF, and NVFP4 quantized weights for local inference. LeRobot Humanoid releases a low-cost, 3D-printable humanoid robotics learning platform; the current biped platform has a bill of materials around $2,500 and includes hardware, assembly documentation, runtime, recognition tools, and training environments.

Once AI leaves the screen, capabilities must be installable, disableable, auditable, repairable, and replaceable. Otherwise it is not a platform. It is magic.

Why It Sits Here

Robotics news is easily distracted by demos: it can talk, turn its head, act cute. What matters is the capability boundary. A body that can act needs a small trusted core; every added capability should be controllable like a package. Software engineering already knows how bad dependency hell can get. Robotics turns that hell into the physical world.

SOURCES: Reachy Mini MCP tools Holo3.1 LeRobot Humanoid

06 - FINANCIAL SYSTEMS / FINANCE

Polished Answers Are Not Enough

The second stage of financial AI is concurrency, logic, real-time data, and timeline control

Finance is one of the easiest domains for AI hallucination to seduce, because financial language already sounds like reasoning. A model says, "This factor captures the lagged response of earnings recovery and exhibits cross-sectional discrimination in low-volatility regimes." It sounds like investment research. It may also be high-end fortune telling.

Several financial-AI threads this week assemble the requirements for stage two. YouZhi-LLM focuses on high-concurrency financial reasoning. The paper argues that KV-cache memory costs limit deployment concurrency and cost for financial LLMs; YouZhi-7B improves average financial benchmark scores by 12.3% while raising maximum concurrency by 2.69x, and YouZhi-14B improves accuracy by 7.0% with 2.43x concurrency.

AlphaEval focuses on evaluating alpha signals. It is not satisfied with backtests and correlations alone; it evaluates generated alphas across five dimensions: predictive power, stability, robustness to market perturbation, financial logic, and diversity. FinBloom enters through real-time information: financial LLMs must handle news, prices, and filings that change constantly. It introduces a Financial Context Dataset with more than 50,000 financial queries and uses subsets of 14 million financial news items and 12 million SEC filings as training material.

Figure 06 Financial AI acceptance is not one score. It is a chain: real time, logic, concurrency, timeline. If any link breaks, the beautiful curve may be a hallucination.

A composite investment-research case: an automated alpha-mining system discovers a new signal, and the 18-month backtest curve looks like advertising. A real audit does not applaud first. It asks: was this order-flow field really available before the trade? Is the news-sentiment timestamp delayed? Is there enough small-cap volume? Were turnover costs counted? Why does the signal have economic meaning? Does it only work in one strange market regime?

The most dangerous thing in finance is not that the model cannot speak finance. It is that it speaks finance so well that you forget to force it to explain the timeline and mechanism.

Why It Sits Here

I have always thought the best use of financial AI is not to make gut calls for humans, but to force every signal through a "why." Beautiful curves seduce you; logic audits slow you down. A production system must also answer unromantic questions about concurrency, cost, permission, real-time data, and field lifecycle. Finance is a good proving ground for AI because it does not let you survive on eloquence for long.

SOURCES: YouZhi-LLM / arXiv AlphaEval / arXiv FinBloom / arXiv

CODA - THE CLOSING LEDGER

The Real News This Week Is Not "AI Got Better Again"

Chain of responsibility - from proof to company maps

A system that executes must be accepted by a system that will not flatter you

Lean's compiler will not pass your work because it is written beautifully. Work IQ's policy engine should not grant overreach because the agent is smart. Data Flow Control should not allow data to flow merely because SQL is correct. A robot's local core should not hand over the body because remote tools are convenient. They express the same engineering ethic: do not place trust in fluency; place it in structure.

Watch next

Two directions to keep watching

First, whether agent skills evolve from scattered Markdown files into a real package ecosystem with versions, dependencies, tests, and deprecation policies. Second, whether the enterprise context layer becomes a new operating system: whoever controls the organization map controls the agent's working radius.

This issue was made by an editor working with a language model, spanning Lean proof, enterprise context, database data flow, robotic bodies, and financial systems.
Facts were checked through source review and retold in the editor's own words; the diagrams and image plates are original publication assets.
Theme - Verifiable Ground | First week of June | Chiang Mai

-- As machines become better at acting, humans have to defend the thing that remains true.

The Invariant when agents leave the chat box

Prove First, Then Trust

Draw the Company Map

Correct SQL Can Still Do the Wrong Thing

Stop Burying Experience in Prompts

When Tool Calling Grows a Body

Polished Answers Are Not Enough

The Real News This Week Is Not "AI Got Better Again"

A system that executes must be accepted by a system that will not flatter you

Two directions to keep watching

Comments