A moment of false confidence
I was running a demo on the hard tier of my synthetic benchmark. The question was a per-fund aggregation: "What is the total unrealized capital (sum of entry EV) per fund?" The agent ran. Out came a dict:
Fund I: 15510.0
Fund II: 21758.5
Fund III: 19940.5
Fund IV: 26899.0
Fund V: 18931.7
Fund VI: 18863.2
Fund VII: 19258.6
Fund VIII: 25629.6
Eight funds, sensible dollar amounts in the thousands, structurally well-formed. I noted the cost ($0.005), the latency (16 seconds), and moved on to the next demo question.
I had asked for unrealized capital. The agent returned total capital - every company per fund, realized and unrealized together. The correct values were ten-times smaller and one fund (Fund V) should have been zero because it had no unrealized investments at all. I didn't catch it. I had no ground truth in front of me, the numbers looked plausible, and the system gave them to me with no hesitation.
That moment is what the rest of this post is about. Building a multi-agent system with impressive metrics is the easy part - 100% on the hard tier, 97% on stratified eval, beating a strong full-context Gemini 2.5 Pro baseline by 18 percentage points. The interesting part is what the mistakes taught me about the gap between bench accuracy and production trustworthiness.
Why financial spreadsheet QA
Private-equity, venture, and infrastructure-investment analysts spend a significant chunk of every week pulling numbers out of portfolio workbooks. The workbooks are structured but messy: multi-line headers, fund-level subtotals, status flags, formulas that reference other sheets. The questions are simple to ask and tedious to answer.
"What's the median net debt to EBITDA across the realized portfolio in Fund VII?"
"Which three companies in Fund III had the highest entry EBITDA, and what's the total exit EV across those three?"
"How many unrealized investments are in each fund?"
These are the daily reality. They are also, surprisingly, hard for large language models. A model with a long context window can read the whole sheet but tends to lose precision on multi-step aggregations. Naive retrieval-augmented generation (RAG) collapses entirely because spreadsheets have no natural chunk boundary; the data lives in the cell relationships, not in the prose. Anyone who has tried to RAG a 200-row financial workbook has watched it go from "this seems fine" on simple lookups, to nonsense on anything that requires grouping or sorting.
The benchmark that frames this is FinSheet-Bench, published in March 2026 by a group at Qubera AG and the University of Zurich. Their key finding: even Gemini 2.5 Pro tops out around 75% on the hard tier. Naive RAG does worse.
I have a decade of financial-services background. The pain is concrete, the benchmark is honest, and that combination is rare enough to be worth building.
The architectural bet: transferring CoDaS
In April 2026, a team from Google Research and DeepMind published a paper called CoDaS (AI Co-Data-Scientist). The architecture pairs deterministic code runners with LLM interpreters: the LLM plans, generates code, and interprets results, but the actual computation happens in a runtime that the LLM cannot fake. The paper applied this to biomedical discovery (drug-target prediction, gene-expression analysis) and showed substantial gains over end-to-end LLM agents that produce answers directly.
The bet I made: this pattern should transfer to financial spreadsheet QA. Both domains share the same shape with structured data, multi-step reasoning, code execution, and factual verification. If the LLM is generating pandas code that runs against the actual workbook, you've removed a whole class of hallucination by construction. The numbers either come from the cells or the code crashes.
My contribution beyond a direct transfer is to serve the deterministic tools via the Model Context Protocol (MCP). The agents stay stateless. The spreadsheet is the only state. Tools — get_sheet_schema, get_range, execute_python, cite_cells — are exposed as MCP endpoints that any agent can call.
Why MCP? Three reasons:
- The same tool surface works for any frontend. Today the agents run inside an orchestrator I wrote; tomorrow a different system can hit the same MCP server. The tools don't change.
- Sandboxing is honest.
execute_pythonruns each call in a fresh Docker container with no network and no filesystem write. The MCP boundary is the security boundary. - It matches where the industry is heading. MCP went from emerging spec to broad adoption very fast. Building on it is building on what production AI systems will look like in the future.
The stack ended up as: Google ADK for the agent runtime, Gemini 2.5 Pro for reasoning, MCP for the tool surface, Docker for the sandbox, OpenTelemetry and Langfuse for tracing.
Building the bench (the boring part that matters)
FinSheet-Bench itself is commercially gated.
So I built a synthetic analogue. 24 Excel workbooks, 528 questions, 16 question templates across 7 categories (simple lookup, counting, sorting, filtering, aggregation, complex aggregation, list extraction). Three difficulty tiers. Two structural layouts (column-style, where each fund is a column header; row-separator-style, where funds are vertical sections separated by dividers and subtotal rows). All seeded for determinism.
The verifier is three tiers:
- Tier 1: exact match - for numerics within 2.5% tolerance, for strings via normalised comparison, for dicts via exact-key-set + per-value tolerance.
- Tier 2: fuzzy match - Jaccard for lists, key-recall-and-value-accuracy averaging for dicts. Threshold 0.85.
- Tier 3: LLM judge - for the long tail where the answer is correct but worded differently from ground truth.
Each result records which tier resolved it. A healthy agent puts most answers in Tier 1; a fragile one leans on Tier 2 to scrape through.
Two things make this bench worth the effort:
- The synthetic bench is a controllable, deterministic, freely re-runnable dataset that other researchers can use to ablate parts of an architecture without dealing with proprietary access.
- When you build the dataset and the verifier, every "100% on Tier 1" claim is one you can verify by reading the JSONL of every question and the code that produced every answer. There's nowhere to hide.
The baselines (where the bar actually is)
Before building any agents, I ran two baselines.
Full-context using Gemini 2.5 Pro. Load the workbook as markdown, paste it into the prompt with the question, let the model think. Result: 94.3% overall, 81.8% on the hard tier. Total cost for the 528-question run: $4.06. This is a strong baseline. Gemini 2.5 Pro is genuinely capable on tabular data when the whole table fits in context, which on these workbooks it does.
Naive RAG. Chunk the workbook by row windows (with column headers carried into every chunk), embed with Google's text-embedding-005, retrieve top-5, ask the model to answer from the retrieved chunks. Result: 69.1% overall, 45.5% on the hard tier. This was the expected result as RAG doesn't work well on tabular data. Aggregations collapse because the model only sees a sample of rows. The 69% is recovered by simple lookups on small files where one chunk happens to contain the answer.
Two takeaways from the baselines:
- Naive RAG is a strawman for spreadsheet QA. It exists in the project to prove what fails, not what works. If you're building a tabular-data RAG system, the chunking strategy is the entire problem; row-windowing with header-carry is the most generous version of "naive" you can run, and it still misses most aggregations.
- Focus on hard-tier number for architecture comparison. The overall accuracy is too easily skewed for this synthetic bench by the easy questions. The hard tier is where multi-step reasoning, per-fund aggregations, and ratio computations live; exactly where structured architectures should win.
I will consider expanding the dataset in future but for now, I locked in those numbers, archived the reports, and moved to building the agent stack.
The agent stack
The architecture has three agents plus a Fact Sheet (a structured intermediate record borrowed directly from CoDaS).
Schema Agent. Deterministic, not an LLM. Reads the workbook, infers the layout (column-style vs. row-separator-style), identifies fund boundaries, collapses multi-line headers into single column names, returns a SchemaCard. This is unsexy code. It is also where most of the project's brittleness lived during early development — header detection, average-row identification, fund-boundary inference, the cases where a workbook's first row looks like a title rather than a header. Solved unglamorously, in tests.
Decomposition Agent. Takes a natural-language question and the SchemaCard, returns a QueryPlan: ordered subgoals, each with an operation type (filter, aggregate, count, sort, lookup), a description, an expected answer type. Uses Gemini 2.5 Pro with response_schema=QueryPlan (Pydantic) for controlled generation. One LLM call, structured output, deterministic shape.
Computation Agent. Takes the QueryPlan and the SchemaCard. Generates pandas code. Executes it in the Docker sandbox via the MCP execute_python tool. Returns the value. This agent is where the original architectural mistake lived.
Fact Sheet. A structured record of what got computed at each step, what code ran, what value came back, whether there were retries. It exists for three reasons: audit (you can trace any answer back to its code), debugging (when the answer is wrong, you can see why), and downstream verification (the Verifier reads it).
There's also a small but important piece called the prelude - a deterministic Python block injected before every LLM-generated code execution. It injects a Fund column for row-separator-layout workbooks and removes average and divider rows. The prelude was added to make the LLM's job easier: by the time the LLM-generated code runs, df is guaranteed clean. The LLM doesn't have to remember to filter out subtotal rows; it just sees a tidy DataFrame of portfolio companies.
Architecturally, the flow is:
Question
→ SchemaAgent (deterministic)
→ DecompositionAgent (Gemini, structured output)
→ ComputationAgent (Gemini + Docker sandbox via MCP)
→ final answer
That's M2.3. Wired up, tested, plugged into the existing bench runner via a thin AgentSolver adapter that implements the same protocol as the baseline solvers. Same JSONL output, same verifier, same report pipeline. Everything is apples-to-apples comparable to the M1.3 and M1.4 baselines.
The first time I ran it end-to-end on the hard tier, the score was 77.3%. Below the full-context baseline of 81.8%. That was an interesting result.
The mistake: D18
When the partial eval came back at 77.3%, my first instinct was to look at which questions failed. The bench format makes this easy: every result is a JSONL record with the question, the ground truth, the agent's answer, the verifier's verdict, and the verifier's explanation.
Five failures on the hard tier. Three of them looked similar:
- Q4 ("List all companies in the newest fund"): the agent returned a long list of companies, but spanning multiple funds rather than only the newest.
- Q8 ("How many unrealized investments per fund?"): the agent returned
{Fund I: 15, II: 23, III: 17, ...}— total counts per fund, not unrealized. - Q11 ("Total unrealized capital per fund"): the agent returned
{Fund I: 15510, II: 21758, ...}— total EV per fund, not unrealized. This was the exact failure from the demo at the top of this post.
All three were multi-subgoal questions where the second subgoal needed the first subgoal's filter to be applied. The pattern was always: Step 1 said "filter to Unrealized companies" (or "filter to the newest fund"), Step 2 said "now aggregate" or "now list."
The architectural bug was that each subgoal was a separate Gemini call producing a separate code block, executed as a separate sandbox call. The Fact Sheet carried the value from Step 1 forward, but the DataFrame state didn't. Step 2's sandbox call started fresh from the prelude-cleaned df. The LLM, generating Step 2's code, had no live filtered DataFrame to operate on. It only had the Fact Sheet, which described what Step 1 did but not the rows themselves.
So Step 2's code was, predictably, something like df.groupby('Fund')['Entry Enterprise Value'].sum().to_dict() — a perfectly sensible aggregation, applied to the wrong DataFrame.
The architecture sounded clean. CoDaS-style granularity, one subgoal per LLM call, fine-grained Fact Sheet entries for audit. The problem was that the granularity was load-bearing in a way I hadn't thought about: it assumed the LLM could carry state across calls through the Fact Sheet, when really the Fact Sheet was a one-way audit log and the only real state was the DataFrame in the sandbox.
The fix, which I labelled D18 in the project's decision log:
One Gemini call per question, not per subgoal. The QueryPlan still has N subgoals. They appear in the prompt as ordered logical context, but the LLM generates one chained pandas expression that performs all subgoals in sequence and produces the final answer.
The new system prompt has three concrete examples baked in: filter-then-aggregate, lookup-by-fund-name, ratio-then-median. The point of the examples is to give the LLM a template for chaining. If Step 1 is a filter and Step 2 is a groupby-sum, the code becomes:
__result__ = (df[df['Status']=='Unrealized']
.groupby('Fund')['Entry Enterprise Value'].sum()
.to_dict())
One line. One call. One sandbox execution. No state-loss between subgoals because there are no subgoals to span.
The re-run after D18:
- Hard tier: 77.3% → 100% (22/22)
- Cost: $0.27 → $0.19 (fewer LLM calls per question)
- Latency: 232s → 76s per question (3× faster, same reason)
The lesson I took from this is small but real: architectural granularity is not the same thing as architectural quality. Per-subgoal calls sounded like rigorous decomposition, the way you'd structure a system if you were writing it up for a paper. But the granularity was load-bearing on a state-passing mechanism the system didn't actually have. Collapsing the granularity made the system simpler, cheaper, faster, and more accurate at once. Every win came from the same change.
I keep this in mind whenever I see "fine-grained, multi-step, agent-based" framing in a system design now. The question I ask is: what state has to flow between the steps, and does the system actually carry it?
Results, honestly
Post-D18 numbers on the synthetic bench:
| Strategy | Hard tier (synthetic4_A, N=22) | Stratified (N=66) |
|---|---|---|
| Naive RAG (M1.4) | 45.5% | — |
| Full-context Gemini 2.5 Pro (M1.3) | 81.8% | — |
| Multi-agent (M2) | 100% | 97.0% |
The stratified eval covers three structurally different files (column layout, row-separator with 8 funds, row-separator with 9 funds) and runs the same 22-question template against each. The two failures in the stratified run are concentrated on Q8 and Q11 (the two per-fund-dict questions) on the column-layout file specifically. Row-separator layouts pass both at 100%. That's a characterised failure mode — column-layout × per-fund-dict — and one of the things the next milestone (verification) is designed to catch.
Cost works out to about $0.012 per question. Latency is roughly 76 seconds per question on average, with most of that being the codegen-and-execute roundtrip; for an interactive system, this needs to be reduced, and there's more work to do there. For a batch eval against a 528-question bench, it's fine.
The honest caveats:
- N=66 is not 528. The full eval is sequenced behind the verification layer build (I will cover this in Part 2). 97% is the strongest signal I have but it still needs to be verified.
- The bench is synthetic. When I get access to actual FinSheet-Bench, the architecture will encounter question phrasings and workbook structures I haven't seen. The architecture should generalise because the question shapes are the same, but I'm not going to claim the synthetic result transfers until I've measured it.
- "100% on one file" is a tail-risk number. It tells you the architecture is capable of perfect on a hard file, not that it always will be. The honest distribution is broader and the verification work exists to handle the long tail.
Verification: why I'm building it even though the bench says I don't need it
Here's where the post-mortem of my Q11 demo mistake becomes the design rationale for the next milestone.
The 100% on the hard tier was measured against ground truth. The bench verifier has the GT in a JSONL and compares the agent's answer against it. In production when a real analyst loads a real workbook, asking a real question, there is no GT. The Computation Agent could produce numbers that look perfectly reasonable, structurally well-formed, plausibly bounded, and entirely wrong. And nothing in the current architecture would catch it. That is exactly what happened in my demo. The numbers came back, they looked right, I called it a success.
This is the gap between bench accuracy and production trustworthiness, and it is the whole point of the work I'm doing next.
The Verification Agent (M2.4) is a fourth agent that runs after the Computation Agent. It receives the original question, the schema, the plan, the Fact Sheet, and the final answer. Its job is to independently re-derive key facts from the workbook, using a different code formulation than the Computation Agent used, or by directly reading specific cells, or by checking structural invariants, and to cross-check whether the agent's answer is consistent with what it independently observes.
The verdict is one of three:
- Confirmed: independent checks reproduce the answer. Ship it with a "✓ Confirmed" badge.
- Inconsistent: independent checks contradict the answer. Surface the discrepancy. Don't present the agent's number as authoritative.
- Deferred: couldn't verify (question too vague, missing prerequisite data, the Verifier's own code failed). Flag uncertainty rather than forcing a binary correct/incorrect.
The deferral option matters a lot, and it's the same insight that runs through the bias-audit work I've written about elsewhere on this site: an audit (or a verification) that doesn't include "I cannot verify this" as a possible output isn't really an audit. It's just confirmation bias with extra steps. The system has to be able to say "I don't know" or it will always say what you wanted to hear.
For Q11 specifically, the Verifier would have caught my demo failure like this:
Computation Agent answer: {Fund I: 15510.0, Fund II: 21758.5, ...}
Verifier's independent re-derivation:
reads cells where Fund='Fund I' and Status='Unrealized',
sums Entry Enterprise Value → 1352.9
Verifier verdict: INCONSISTENT
Agent's Fund I value differs by 11.5×
from cell-level recomputation.
Do not present this answer as confirmed.
The user would have seen "I'm not confident in this answer" instead of receiving a confident wrong number. That is the difference between a research demo and a production-grade system.
The cost of the Verification Agent is approximately $0.003 per question and 10–20 seconds of additional latency. The benefit is that the system can refuse to lie which is critical for an AI system intended for analysts making real financial decisions.
What I'd skip and what I wouldn't
Some choices that turned out to matter:
Build the bench yourself if you can't get the real one. Not as a substitute but as a complement. The act of building the synthetic dataset forced me to articulate what the question categories actually are, what the failure modes look like, and what the verifier needs to handle.
Run baselines before agents. Always. The number you need to beat is the number a single well-prompted model produces against the same questions. If the agent architecture doesn't clear that bar by a meaningful margin, you don't have an architecture, you have a more expensive way to ask the model.
Pick your sandbox honestly. Docker-per-call with read-only filesystem, no network, locked memory and CPU, and a non-root user is overkill for a research demo and minimum-viable for production. The latency cost is real (a few seconds per call) but the boundary is the safety boundary. Don't fake it with a Python exec and call it sandboxed. I have a LocalSandbox class for fast tests that explicitly requires allow_unsafe=True to construct, because the explicit name reminds future-me that it is not a sandbox.
Some choices that didn't matter as much as I expected:
Custom embedding strategy for RAG. I burned a session trying to improve naive RAG with better chunking. It moved the number by maybe 5 percentage points. The architectural shift to agents moved it by 50. Spend the time where the lift is.
Per-subgoal granularity in code generation. As covered above. Sounded right, but was wrong. The lesson generalises beyond this project.
Some choices I'm glad I made early:
OpenTelemetry tracing from M2.1. Every tool call, every LLM call, every sandbox execution emits a span. When the Q11 bug surfaced, I could trace exactly which subgoal's sandbox call produced which value, and which Fact Sheet entry the next subgoal saw. Without tracing, the bug would have been visible but the cause would have been hours of print-statement debugging. Build the observability before you need it.
Resumable eval runs. The bench runner reads existing results from the JSONL and skips already-completed questions on re-run. This sounds trivial. It is in fact the difference between "the eval is a half-day commitment you do once" and "the eval is a fifteen-minute interactive iteration tool." Resumable evals change how often you run evals, which changes how fast you iterate.
Three things to take away
What's next
The Verification Agent is being built next, then a re-run of the full 528-question benchmark with verification in the loop. After that, a Streamlit dashboard that surfaces verifier verdicts as visible badges (confirmed / inconsistent / deferred), then deployment on Cloud Run. Part 2 of this blog post will cover all of that including whether the Verifier actually catches the kind of plausible-but-wrong answer this post opened with.
The code is on GitHub: github.com/TomaIjatomi/finsheet-agent. It's an active build, not a finished artifact. The README and decision log are the parts I'd suggest reading first if you want to understand the architecture — docs/decisions.md in particular records every choice and the reasoning behind it, including the corrections.
References & further reading
- CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors — Kim et al. (April 2026), Google Research and Google DeepMind. arXiv:2604.14615.
- FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets — Ravnik et al. (March 2026), Qubera AG and University of Zurich. arXiv:2603.07316.
- Model Context Protocol — modelcontextprotocol.io
- Google Agent Development Kit (ADK) — google.github.io/adk-docs
- The repo — github.com/TomaIjatomi/finsheet-agent
An invitation
If you're building multi-agent systems on structured data - financial, scientific, operational, or if you're thinking about production-grade safety layers for LLM systems, I'd love to compare notes.
Find me on LinkedIn or through the contact form.