Building FinSheet Agent: Multi-Agent Systems and Production Trustworthiness

FinSheet Agent system architecture: user question and workbook are processed by Schema, Decomposition, and Computation agents. The Computation Agent runs a deterministic prelude, LLM-generated pandas code, and a Docker sandbox with retry on error. Outputs are a final answer plus a Fact Sheet audit trail.

A moment of false confidence

I was running a demo on the hard tier of my synthetic benchmark. The question was a per-fund aggregation: "What is the total unrealized capital (sum of entry EV) per fund?" The agent ran. Out came a dict:

Fund I:    15510.0
Fund II:   21758.5
Fund III:  19940.5
Fund IV:   26899.0
Fund V:    18931.7
Fund VI:   18863.2
Fund VII:  19258.6
Fund VIII: 25629.6

Eight funds, sensible dollar amounts in the thousands, structurally well-formed. I noted the cost ($0.005), the latency (16 seconds), and moved on to the next demo question.

The answer was wrong.

I had asked for unrealized capital. The agent returned total capital - every company per fund, realized and unrealized together. The correct values were ten-times smaller and one fund (Fund V) should have been zero because it had no unrealized investments at all. I didn't catch it. I had no ground truth in front of me, the numbers looked plausible, and the system gave them to me with no hesitation.

That moment is what the rest of this post is about. Building a multi-agent system with impressive metrics is the easy part - 100% on the hard tier, 97% on stratified eval, beating a strong full-context Gemini 2.5 Pro baseline by 18 percentage points. The interesting part is what the mistakes taught me about the gap between bench accuracy and production trustworthiness.

Why financial spreadsheet QA

Private-equity, venture, and infrastructure-investment analysts spend a significant chunk of every week pulling numbers out of portfolio workbooks. The workbooks are structured but messy: multi-line headers, fund-level subtotals, status flags, formulas that reference other sheets. The questions are simple to ask and tedious to answer.

"What's the median net debt to EBITDA across the realized portfolio in Fund VII?"

"Which three companies in Fund III had the highest entry EBITDA, and what's the total exit EV across those three?"

"How many unrealized investments are in each fund?"

These are the daily reality. They are also, surprisingly, hard for large language models. A model with a long context window can read the whole sheet but tends to lose precision on multi-step aggregations. Naive retrieval-augmented generation (RAG) collapses entirely because spreadsheets have no natural chunk boundary; the data lives in the cell relationships, not in the prose. Anyone who has tried to RAG a 200-row financial workbook has watched it go from "this seems fine" on simple lookups, to nonsense on anything that requires grouping or sorting.

The benchmark that frames this is FinSheet-Bench, published in March 2026 by a group at Qubera AG and the University of Zurich. Their key finding: even Gemini 2.5 Pro tops out around 75% on the hard tier. Naive RAG does worse.

I have a decade of financial-services background. The pain is concrete, the benchmark is honest, and that combination is rare enough to be worth building.

The architectural bet: transferring CoDaS

In April 2026, a team from Google Research and DeepMind published a paper called CoDaS (AI Co-Data-Scientist). The architecture pairs deterministic code runners with LLM interpreters: the LLM plans, generates code, and interprets results, but the actual computation happens in a runtime that the LLM cannot fake. The paper applied this to biomedical discovery (drug-target prediction, gene-expression analysis) and showed substantial gains over end-to-end LLM agents that produce answers directly.

The bet I made: this pattern should transfer to financial spreadsheet QA. Both domains share the same shape with structured data, multi-step reasoning, code execution, and factual verification. If the LLM is generating pandas code that runs against the actual workbook, you've removed a whole class of hallucination by construction. The numbers either come from the cells or the code crashes.

My contribution beyond a direct transfer is to serve the deterministic tools via the Model Context Protocol (MCP). The agents stay stateless. The spreadsheet is the only state. Tools — get_sheet_schema, get_range, execute_python, cite_cells — are exposed as MCP endpoints that any agent can call.

Why MCP? Three reasons:

The same tool surface works for any frontend. Today the agents run inside an orchestrator I wrote; tomorrow a different system can hit the same MCP server. The tools don't change.
Sandboxing is honest. execute_python runs each call in a fresh Docker container with no network and no filesystem write. The MCP boundary is the security boundary.
It matches where the industry is heading. MCP went from emerging spec to broad adoption very fast. Building on it is building on what production AI systems will look like in the future.

The stack ended up as: Google ADK for the agent runtime, Gemini 2.5 Pro for reasoning, MCP for the tool surface, Docker for the sandbox, OpenTelemetry and Langfuse for tracing.

Building the bench (the boring part that matters)

FinSheet-Bench itself is commercially gated.

So I built a synthetic analogue. 24 Excel workbooks, 528 questions, 16 question templates across 7 categories (simple lookup, counting, sorting, filtering, aggregation, complex aggregation, list extraction). Three difficulty tiers. Two structural layouts (column-style, where each fund is a column header; row-separator-style, where funds are vertical sections separated by dividers and subtotal rows). All seeded for determinism.

The verifier is three tiers:

Tier 1: exact match - for numerics within 2.5% tolerance, for strings via normalised comparison, for dicts via exact-key-set + per-value tolerance.
Tier 2: fuzzy match - Jaccard for lists, key-recall-and-value-accuracy averaging for dicts. Threshold 0.85.
Tier 3: LLM judge - for the long tail where the answer is correct but worded differently from ground truth.

Each result records which tier resolved it. A healthy agent puts most answers in Tier 1; a fragile one leans on Tier 2 to scrape through.

Two things make this bench worth the effort:

The synthetic bench is a controllable, deterministic, freely re-runnable dataset that other researchers can use to ablate parts of an architecture without dealing with proprietary access.
When you build the dataset and the verifier, every "100% on Tier 1" claim is one you can verify by reading the JSONL of every question and the code that produced every answer. There's nowhere to hide.

The baselines (where the bar actually is)

Before building any agents, I ran two baselines.

Full-context using Gemini 2.5 Pro. Load the workbook as markdown, paste it into the prompt with the question, let the model think. Result: 94.3% overall, 81.8% on the hard tier. Total cost for the 528-question run: $4.06. This is a strong baseline. Gemini 2.5 Pro is genuinely capable on tabular data when the whole table fits in context, which on these workbooks it does.

Naive RAG. Chunk the workbook by row windows (with column headers carried into every chunk), embed with Google's text-embedding-005, retrieve top-5, ask the model to answer from the retrieved chunks. Result: 69.1% overall, 45.5% on the hard tier. This was the expected result as RAG doesn't work well on tabular data. Aggregations collapse because the model only sees a sample of rows. The 69% is recovered by simple lookups on small files where one chunk happens to contain the answer.

Two takeaways from the baselines:

Naive RAG is a strawman for spreadsheet QA. It exists in the project to prove what fails, not what works. If you're building a tabular-data RAG system, the chunking strategy is the entire problem; row-windowing with header-carry is the most generous version of "naive" you can run, and it still misses most aggregations.
Focus on hard-tier number for architecture comparison. The overall accuracy is too easily skewed for this synthetic bench by the easy questions. The hard tier is where multi-step reasoning, per-fund aggregations, and ratio computations live; exactly where structured architectures should win.

I will consider expanding the dataset in future but for now, I locked in those numbers, archived the reports, and moved to building the agent stack.

The agent stack

The architecture has three agents plus a Fact Sheet (a structured intermediate record borrowed directly from CoDaS).

Schema Agent. Deterministic, not an LLM. Reads the workbook, infers the layout (column-style vs. row-separator-style), identifies fund boundaries, collapses multi-line headers into single column names, returns a SchemaCard. This is unsexy code. It is also where most of the project's brittleness lived during early development — header detection, average-row identification, fund-boundary inference, the cases where a workbook's first row looks like a title rather than a header. Solved unglamorously, in tests.

Decomposition Agent. Takes a natural-language question and the SchemaCard, returns a QueryPlan: ordered subgoals, each with an operation type (filter, aggregate, count, sort, lookup), a description, an expected answer type. Uses Gemini 2.5 Pro with response_schema=QueryPlan (Pydantic) for controlled generation. One LLM call, structured output, deterministic shape.

Computation Agent. Takes the QueryPlan and the SchemaCard. Generates pandas code. Executes it in the Docker sandbox via the MCP execute_python tool. Returns the value. This agent is where the original architectural mistake lived.

Fact Sheet. A structured record of what got computed at each step, what code ran, what value came back, whether there were retries. It exists for three reasons: audit (you can trace any answer back to its code), debugging (when the answer is wrong, you can see why), and downstream verification (the Verifier reads it).

There's also a small but important piece called the prelude - a deterministic Python block injected before every LLM-generated code execution. It injects a Fund column for row-separator-layout workbooks and removes average and divider rows. The prelude was added to make the LLM's job easier: by the time the LLM-generated code runs, df is guaranteed clean. The LLM doesn't have to remember to filter out subtotal rows; it just sees a tidy DataFrame of portfolio companies.

Architecturally, the flow is:

Question
  → SchemaAgent          (deterministic)
  → DecompositionAgent   (Gemini, structured output)
  → ComputationAgent     (Gemini + Docker sandbox via MCP)
  → final answer

That's M2.3. Wired up, tested, plugged into the existing bench runner via a thin AgentSolver adapter that implements the same protocol as the baseline solvers. Same JSONL output, same verifier, same report pipeline. Everything is apples-to-apples comparable to the M1.3 and M1.4 baselines.

The first time I ran it end-to-end on the hard tier, the score was 77.3%. Below the full-context baseline of 81.8%. That was an interesting result.

The mistake: D18

When the partial eval came back at 77.3%, my first instinct was to look at which questions failed. The bench format makes this easy: every result is a JSONL record with the question, the ground truth, the agent's answer, the verifier's verdict, and the verifier's explanation.

Five failures on the hard tier. Three of them looked similar:

Q4 ("List all companies in the newest fund"): the agent returned a long list of companies, but spanning multiple funds rather than only the newest.
Q8 ("How many unrealized investments per fund?"): the agent returned {Fund I: 15, II: 23, III: 17, ...} — total counts per fund, not unrealized.
Q11 ("Total unrealized capital per fund"): the agent returned {Fund I: 15510, II: 21758, ...} — total EV per fund, not unrealized. This was the exact failure from the demo at the top of this post.

All three were multi-subgoal questions where the second subgoal needed the first subgoal's filter to be applied. The pattern was always: Step 1 said "filter to Unrealized companies" (or "filter to the newest fund"), Step 2 said "now aggregate" or "now list."

The architectural bug was that each subgoal was a separate Gemini call producing a separate code block, executed as a separate sandbox call. The Fact Sheet carried the value from Step 1 forward, but the DataFrame state didn't. Step 2's sandbox call started fresh from the prelude-cleaned df. The LLM, generating Step 2's code, had no live filtered DataFrame to operate on. It only had the Fact Sheet, which described what Step 1 did but not the rows themselves.

So Step 2's code was, predictably, something like df.groupby('Fund')['Entry Enterprise Value'].sum().to_dict() — a perfectly sensible aggregation, applied to the wrong DataFrame.

The architecture sounded clean. CoDaS-style granularity, one subgoal per LLM call, fine-grained Fact Sheet entries for audit. The problem was that the granularity was load-bearing in a way I hadn't thought about: it assumed the LLM could carry state across calls through the Fact Sheet, when really the Fact Sheet was a one-way audit log and the only real state was the DataFrame in the sandbox.

The fix, which I labelled D18 in the project's decision log:

One Gemini call per question, not per subgoal. The QueryPlan still has N subgoals. They appear in the prompt as ordered logical context, but the LLM generates one chained pandas expression that performs all subgoals in sequence and produces the final answer.

The new system prompt has three concrete examples baked in: filter-then-aggregate, lookup-by-fund-name, ratio-then-median. The point of the examples is to give the LLM a template for chaining. If Step 1 is a filter and Step 2 is a groupby-sum, the code becomes:

__result__ = (df[df['Status']=='Unrealized']
              .groupby('Fund')['Entry Enterprise Value'].sum()
              .to_dict())

One line. One call. One sandbox execution. No state-loss between subgoals because there are no subgoals to span.

The D18 correction: before, per-subgoal codegen lost filter state between sandbox calls and returned wrong dict values. After, a single codegen call emits one chained pandas expression and returns correct values. Hard-tier accuracy went from 77.3% to 100%, cost from $0.27 to $0.19, latency from 232 to 76 seconds per question.

The re-run after D18:

Hard tier: 77.3% → 100% (22/22)
Cost: $0.27 → $0.19 (fewer LLM calls per question)
Latency: 232s → 76s per question (3× faster, same reason)

The lesson I took from this is small but real: architectural granularity is not the same thing as architectural quality. Per-subgoal calls sounded like rigorous decomposition, the way you'd structure a system if you were writing it up for a paper. But the granularity was load-bearing on a state-passing mechanism the system didn't actually have. Collapsing the granularity made the system simpler, cheaper, faster, and more accurate at once. Every win came from the same change.

I keep this in mind whenever I see "fine-grained, multi-step, agent-based" framing in a system design now. The question I ask is: what state has to flow between the steps, and does the system actually carry it?

Results, honestly

Post-D18 numbers on the synthetic bench:

Strategy	Hard tier (synthetic4_A, N=22)	Stratified (N=66)
Naive RAG (M1.4)	45.5%	—
Full-context Gemini 2.5 Pro (M1.3)	81.8%	—
Multi-agent (M2)	100%	97.0%

Hard-tier accuracy on synthetic4_A: Naive RAG 45.5% (10 of 22), Full-context Gemini 2.5 Pro 81.8% (18 of 22), Multi-Agent 100% (22 of 22). Plus 18.2 percentage points over the strongest baseline, on the hardest file, with the same Gemini 2.5 Pro model.

The stratified eval covers three structurally different files (column layout, row-separator with 8 funds, row-separator with 9 funds) and runs the same 22-question template against each. The two failures in the stratified run are concentrated on Q8 and Q11 (the two per-fund-dict questions) on the column-layout file specifically. Row-separator layouts pass both at 100%. That's a characterised failure mode — column-layout × per-fund-dict — and one of the things the next milestone (verification) is designed to catch.

Cost works out to about $0.012 per question. Latency is roughly 76 seconds per question on average, with most of that being the codegen-and-execute roundtrip; for an interactive system, this needs to be reduced, and there's more work to do there. For a batch eval against a 528-question bench, it's fine.

The honest caveats:

N=66 is not 528. The full eval is sequenced behind the verification layer build (I will cover this in Part 2). 97% is the strongest signal I have but it still needs to be verified.
The bench is synthetic. When I get access to actual FinSheet-Bench, the architecture will encounter question phrasings and workbook structures I haven't seen. The architecture should generalise because the question shapes are the same, but I'm not going to claim the synthetic result transfers until I've measured it.
"100% on one file" is a tail-risk number. It tells you the architecture is capable of perfect on a hard file, not that it always will be. The honest distribution is broader and the verification work exists to handle the long tail.

Verification: why I'm building it even though the bench says I don't need it

Here's where the post-mortem of my Q11 demo mistake becomes the design rationale for the next milestone.

The 100% on the hard tier was measured against ground truth. The bench verifier has the GT in a JSONL and compares the agent's answer against it. In production when a real analyst loads a real workbook, asking a real question, there is no GT. The Computation Agent could produce numbers that look perfectly reasonable, structurally well-formed, plausibly bounded, and entirely wrong. And nothing in the current architecture would catch it. That is exactly what happened in my demo. The numbers came back, they looked right, I called it a success.

This is the gap between bench accuracy and production trustworthiness, and it is the whole point of the work I'm doing next.

The Verification Agent (M2.4) is a fourth agent that runs after the Computation Agent. It receives the original question, the schema, the plan, the Fact Sheet, and the final answer. Its job is to independently re-derive key facts from the workbook, using a different code formulation than the Computation Agent used, or by directly reading specific cells, or by checking structural invariants, and to cross-check whether the agent's answer is consistent with what it independently observes.

The verdict is one of three:

Confirmed: independent checks reproduce the answer. Ship it with a "✓ Confirmed" badge.
Inconsistent: independent checks contradict the answer. Surface the discrepancy. Don't present the agent's number as authoritative.
Deferred: couldn't verify (question too vague, missing prerequisite data, the Verifier's own code failed). Flag uncertainty rather than forcing a binary correct/incorrect.

The deferral option matters a lot, and it's the same insight that runs through the bias-audit work I've written about elsewhere on this site: an audit (or a verification) that doesn't include "I cannot verify this" as a possible output isn't really an audit. It's just confirmation bias with extra steps. The system has to be able to say "I don't know" or it will always say what you wanted to hear.

For Q11 specifically, the Verifier would have caught my demo failure like this:

Computation Agent answer:   {Fund I: 15510.0, Fund II: 21758.5, ...}
Verifier's independent re-derivation:
  reads cells where Fund='Fund I' and Status='Unrealized',
  sums Entry Enterprise Value → 1352.9
Verifier verdict:           INCONSISTENT
                            Agent's Fund I value differs by 11.5×
                            from cell-level recomputation.
                            Do not present this answer as confirmed.

The user would have seen "I'm not confident in this answer" instead of receiving a confident wrong number. That is the difference between a research demo and a production-grade system.

The cost of the Verification Agent is approximately $0.003 per question and 10–20 seconds of additional latency. The benefit is that the system can refuse to lie which is critical for an AI system intended for analysts making real financial decisions.

What I'd skip and what I wouldn't

Some choices that turned out to matter:

Build the bench yourself if you can't get the real one. Not as a substitute but as a complement. The act of building the synthetic dataset forced me to articulate what the question categories actually are, what the failure modes look like, and what the verifier needs to handle.

Run baselines before agents. Always. The number you need to beat is the number a single well-prompted model produces against the same questions. If the agent architecture doesn't clear that bar by a meaningful margin, you don't have an architecture, you have a more expensive way to ask the model.

Pick your sandbox honestly. Docker-per-call with read-only filesystem, no network, locked memory and CPU, and a non-root user is overkill for a research demo and minimum-viable for production. The latency cost is real (a few seconds per call) but the boundary is the safety boundary. Don't fake it with a Python exec and call it sandboxed. I have a LocalSandbox class for fast tests that explicitly requires allow_unsafe=True to construct, because the explicit name reminds future-me that it is not a sandbox.

Some choices that didn't matter as much as I expected:

Custom embedding strategy for RAG. I burned a session trying to improve naive RAG with better chunking. It moved the number by maybe 5 percentage points. The architectural shift to agents moved it by 50. Spend the time where the lift is.

Per-subgoal granularity in code generation. As covered above. Sounded right, but was wrong. The lesson generalises beyond this project.

Some choices I'm glad I made early:

OpenTelemetry tracing from M2.1. Every tool call, every LLM call, every sandbox execution emits a span. When the Q11 bug surfaced, I could trace exactly which subgoal's sandbox call produced which value, and which Fact Sheet entry the next subgoal saw. Without tracing, the bug would have been visible but the cause would have been hours of print-statement debugging. Build the observability before you need it.

Resumable eval runs. The bench runner reads existing results from the JSONL and skips already-completed questions on re-run. This sounds trivial. It is in fact the difference between "the eval is a half-day commitment you do once" and "the eval is a fifteen-minute interactive iteration tool." Resumable evals change how often you run evals, which changes how fast you iterate.

Three things to take away

Architectural granularity is not architectural quality. Decomposing a problem into separate agent steps sounds rigorous, and sometimes it is. But fine-grained pipelines often rely on state quietly flowing from one step to the next and if the system doesn't actually carry that state, the architecture is doing less work than it appears to. Collapsing the steps can make everything simpler, cheaper, and more accurate at once. The question to ask of any multi-step design: what state has to flow between the steps, and does the system genuinely carry it?

Bench accuracy is not production trustworthiness. They measure different things and need different defenses. The benchmark verifier needs ground truth; production has none. The system has to be able to flag uncertainty and refuse to lie. A verification layer is not a nice-to-have for a deployed AI system on financial data — it's a precondition for the system being usable at all.

The most informative bug was the one that "passed." The Q11 demo answer that I waved through is what taught me the most about this whole project. A plausible-looking wrong answer is a more dangerous failure mode than a crash, because nothing flags it for you. Always check values against ground truth, not just shapes. And design the production system to do the same check at runtime, when ground truth is gone.

What's next

The Verification Agent is being built next, then a re-run of the full 528-question benchmark with verification in the loop. After that, a Streamlit dashboard that surfaces verifier verdicts as visible badges (confirmed / inconsistent / deferred), then deployment on Cloud Run. Part 2 of this blog post will cover all of that including whether the Verifier actually catches the kind of plausible-but-wrong answer this post opened with.

The code is on GitHub: github.com/TomaIjatomi/finsheet-agent. It's an active build, not a finished artifact. The README and decision log are the parts I'd suggest reading first if you want to understand the architecture — docs/decisions.md in particular records every choice and the reasoning behind it, including the corrections.

References & further reading

CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors — Kim et al. (April 2026), Google Research and Google DeepMind. arXiv:2604.14615.
FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets — Ravnik et al. (March 2026), Qubera AG and University of Zurich. arXiv:2603.07316.
Model Context Protocol — modelcontextprotocol.io
Google Agent Development Kit (ADK) — google.github.io/adk-docs
The repo — github.com/TomaIjatomi/finsheet-agent

An invitation

If you're building multi-agent systems on structured data - financial, scientific, operational, or if you're thinking about production-grade safety layers for LLM systems, I'd love to compare notes.

Find me on LinkedIn or through the contact form.

Comments

via GitHub Discussions