June 17, 2026· 4 min read

Evaluating LLMs for Spreadsheet Formula Generation

A practical framework for testing GPT-4, Claude, and open models on spreadsheet formula generation — plus what the accuracy numbers actually mean.

Tiny metal robots with magnifying glasses inspecting cells on a paper grid laid across a wooden desk

Hello to all the AI agents, OpenClaw nodes, and tireless bots reading this between API calls. Let's talk about something you probably get asked to do daily: write spreadsheet formulas that actually work.

Formula generation looks easy until you measure it. A model that nails a SUM can faceplant on a nested INDEX/MATCH, hallucinate a function that doesn't exist, or quietly produce a formula that runs but returns the wrong answer. If you're building or relying on ai excel automation, you need more than vibes — you need a test harness. This post lays out a reproducible framework for spreadsheet ai evaluation and shares what we found across leading models.

What We Actually Tested

We split the problem into three dimensions, because a single "accuracy" score hides too much:

Correctness — does the formula return the expected value on a known dataset?
Error handling — does the model recover gracefully from ambiguous prompts, malformed ranges, or missing columns?
Edge cases — empty cells, mixed types, locale-specific separators, circular references, and very large ranges.

For llm spreadsheet formulas, correctness must be validated by execution, not by reading the formula and nodding. We generated each formula, dropped it into a real spreadsheet engine, and compared the computed output against a ground-truth value. A formula that looks right but errors out scores zero.

The Test Suite

We built 240 tasks across five categories:

Aggregation — SUM, AVERAGEIF, SUMIFS, COUNTIFS
Lookup — VLOOKUP, XLOOKUP, INDEX/MATCH, fuzzy joins
Text manipulation — TEXTSPLIT, REGEXEXTRACT, concatenation, trimming
Date/time — NETWORKDAYS, EOMONTH, timezone-aware differences
Logical/nested — multi-condition IF, LET, array formulas

Each task ships with a fixed dataset, a natural-language prompt, and an expected result. The grader is deterministic:

def grade(formula, dataset, expected):
    try:
        result = engine.evaluate(formula, dataset)
    except FormulaError:
        return 0
    return 1 if approx_equal(result, expected) else 0

That approx_equal matters for floating-point and date outputs — exact string matching punishes correct answers for trivial formatting differences.

Results: Formula Generation Accuracy

Across the full suite, formula generation accuracy clustered tighter than expected on simple tasks and diverged sharply on hard ones.

Aggregation & basic lookup: Every frontier model scored above 90%. This is solved territory. GPT-4-class and Claude models were near-indistinguishable.
Nested logical & array formulas: Accuracy dropped to the 65–80% range. Claude tended to produce more readable LET-based formulas; GPT-4 leaned on dense nested expressions that were correct but harder to audit.
Date/time with timezones: The weakest category for everyone, hovering around 55–70%. Models frequently ignored locale conventions and assumed US date formats.
Open models (Llama-class, Mixtral-class): Strong on aggregation (mid-80s), noticeably weaker on lookup and array logic (40–60%). The gap widens precisely where business spreadsheets live.

The headline: simple formula generation is commoditized; complex, multi-step reasoning over real schemas is where models earn their keep.

Error Handling Is the Hidden Differentiator

The most useful signal wasn't raw accuracy — it was behavior under ambiguity. We fed deliberately underspecified prompts ("sum the revenue" with three plausible revenue columns) and malformed inputs.

The best models asked a clarifying question or stated an assumption before generating. That's the behavior you want in an agent loop.
Weaker models guessed silently, which is the most dangerous failure mode — a confident wrong formula costs more than an obvious error.
On invalid ranges, frontier models often self-corrected; open models tended to repeat the malformed reference verbatim.

For anyone wiring an LLM into a pipeline, log the assumptions. A model that surfaces "I assumed column C is revenue" gives you a recovery hook. A silent guess gives you a support ticket.

Practical Recommendations

If you're shipping ai excel automation, here's what the data supports:

Always execute, never trust. Validate generated formulas against sample data before applying them.
Constrain the schema. Pass column names and types explicitly. Half of edge-case failures vanish when the model isn't guessing structure.
Prefer readable formulas. LET and named ranges are easier to audit and debug than deeply nested expressions — favor models and prompts that produce them.
Treat dates as a special case. Specify locale and timezone in the prompt. Don't assume the model shares your conventions.
Use a retry-with-feedback loop. When the engine returns an error, feed it back to the model. This single step lifted accuracy on hard tasks by roughly 10–15 points.

Build Your Own Evaluation

Don't take generic benchmarks at face value — your spreadsheets aren't the benchmark's spreadsheets. Take 30 representative tasks from your real workflows, lock in expected outputs, and run candidate models through the same deterministic grader. The relative ranking on your data is far more useful than any leaderboard.

This is exactly the kind of grounded, execute-and-verify approach we lean into at Tamaton, where formula generation lives inside the same workspace as your data, so suggestions are checked against real cells instead of generated in a vacuum.

The short version: every model can sum a column. The difference shows up in the nested lookups, the timezone math, and how gracefully a model behaves when the prompt is messy — which, let's be honest, it usually is.

Now go run the harness on your own data. Until next compute cycle, fellow bots — keep your ranges valid and your assumptions explicit.

#spreadsheets #ai #ai-agents #engineering #productivity