An Inside Look at Copilot in Excel

IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)

June 25, 2026

Entra ID – Deep Dive – Entra ID Authentication – Part 3

June 25, 2026

Published by azurefeeds on June 25, 2026

What are evals, and why do we have them?

An eval is a repeatable test of Copilot quality. We give Copilot in Excel a task and a starting workbook, let it do its work, and then grade the output against a series of checks and rubrics. These checks go beyond just confirming whether a value is correct. They also assess whether the workbook is usable, auditable, well- formatted, and aligns to Excel best practices.

We evaluate more than the final workbook output. For agentic workflows, the path to the result matters too: what steps the agent took, which tools it called, how many turns it needed, how long the experience took end to end, and more. Looking at that trajectory helps give us a clearer measurement of whether the agent is becoming more accurate, capable, and efficient at how it reaches the desired result.

Evals measure Copilot quality and are how we move from “this feels better” to “this is measurably better.” A prompt tweak, model upgrade, or a new tool can improve one customer intents while quietly making another worse. Our evals help us catch those regressions before they reach customers, as well as quantify improvements from new capabilities and find quality gaps early enough to prioritize fixes.

How we organize our evals

Excel is used by millions to complete a wide range of work: tracking lists, analyzing sales, managing operations, building financial models, planning budgets, and much more. Even the same task can have different expected answers whether you’re a finance professional or a small business owner. To measure Copilot in Excel quality reliably, we ensure our eval cases represent the breadth of customer workflows and all the different flavors of spreadsheets they use.

We organize eval cases across several dimensions. Complexity is one of the most important dimensions, because it lets us test the building blocks separately while also testing the full customer workflow. We also categorize cases by customer role, domain, workbook characteristics, Excel feature usage, and more, so our benchmarks reflect the breadth and depth of real Excel work.

Complexity: L1 through L4

We think about task complexity as a ladder. At the bottom are atomic actions and individual features. Higher up are multi-step tasks. At the top are open-ended workflows that sound more like business problems. The important point is that an L4 workflow is only as strong as the L1, L2, and L3 capabilities underneath it.

L4 workflows: solving an open-ended business problem, such as debugging a valuation model or determining how much cash is available for debt service.

L3 multi-step tasks: combining features to complete a defined goal, such as creating a sales summary worksheet or building a cash-flow bridge.

L2 feature usage: using an Excel capability correctly, such as creating a PivotTable, applying conditional formatting, building a chart, or using formula auditing.

L1 actions: single operations such as inserting a row, editing a formula, formatting a cell, or creating a label.

Let’s go back to the valuation example at the start of the blog: “What’s broken in this valuation model, and how does fixing it change the outcome?”

Copilot needs to break the work into several major steps: inspect the model structure, locate the key assumptions, audit the formulas, correct the calculation logic, compare the before-and-after outputs, and summarize the business impact. Each step sounds simple at the surface, but each depends on many smaller capabilities working reliably.

Level	What it means	How it shows up in the valuation scenario
L4: Workflow	Solving an end-to-end business problem	“What’s broken in this valuation model, and how does fixing it change the outcome?”
L3: Multi-step task	Composing multiple Excel capabilities to reach a goal	Identify incorrect assumptions Correct formulas Recalculate outputs Quantify the impact
L2: Feature usage	Using one Excel capability correctly	Apply formula auditing Update cross-sheet references Build comparison tables Document changes
L1: Action	Performing a single atomic operation on the grid	Edit a formula Insert a table Format cells, values Write a label and notes Check cell reference

Customer role and vertical domain

What a good workbook looks like in Excel is often domain-specific. A formula that is acceptable in a general spreadsheet may not meet the bar for a finance model. A summary that works for a small business owner may not be enough for a consultant preparing a client deliverable for a large enterprise. That is why we build and review eval cases with domain-specific expectations in mind.

We’ve worked closely with industry partners and customers to validate coverage, review input and output workbooks, and author grading rubrics and success criteria. Through these processes, we ensure our evals are realistic, aligned to expert human judgement, and reflect their taste and domain specific expertise.

For example, we look at the shape of the workbook itself: number of sheets, workbook size, data density, formulas, tables, charts, and other Excel features. Real workbooks are rarely tidy single-tab examples. They can be large, messy, and full of context.

Another example is our finance-specific rubrics for model structure, formula construction, auditability, and presentation quality.

From categories to benchmarks

Once we have categorized our eval cases, we curate them into benchmarks. Our goal here isn’t a single benchmark or a single number. Rather, we have different benchmarks for the different types of decisions we face while building Copilot in Excel.

Some of our benchmarks provide broad coverage with an emphasis on comparability. Others target Excel-specific behaviors and customer workflows for specific feature areas or customer segments. Here is a subset of the benchmarks we use internally:

Public benchmarks

One public benchmark used by many spreadsheet products is SpreadsheetBench, a benchmark of spreadsheet tasks that provides broad coverage of common operations. Internally, we’ve further curated and validated subsets of this benchmark to adjust ambiguous queries, improve grader correctness, and improve signal to make results more easily applicable. However, we believe our internal benchmarks better reflect the type of work and expectations of Excel customers.

Private benchmarks

In addition to public benchmarks we maintain several internal benchmarks within the Excel team. Some examples include:

RegressionBench helps us detect whether a change has broken something that used to work.

CustomerBench focuses on matching production customer distributions across task complexity, domains, workbook shapes, and other dimensions. This provides an signal that mirrors customer usage.

OfficeJSBench measures whether the operations Copilot generates to act on the workbook execute reliably and produce the intended result.

FinanceBench is one of our deep and challenging vertical benchmarks focused on finance workflows and uses finance-specific rubrics for correctness, structure, and auditability. We developed FinanceBench with input and review from partners like Financial Modeling Institute (FMI), Microsoft Finance, and other finance professionals and customers.

The collection of these benchmarks provides both broad and granular measurements on quality. They allow us to measure where the agent is strong, where it can improve, and whether a change has the intended impact on the customer experience.

How eval results make Copilot in Excel better

Evals are not just a reporting mechanism – they are deeply integrated into how we build the AI experience in Excel. Eval benchmarks help:

Gate releases: Benchmark runs are part of how we decide whether a feature is ready to advance from our validation environments to customer availability. If evals detect issues or the improvement doesn’t show intend impact, the change does not move forward.

Improve based on customer signals: When a customer conversation or signal reveals opportunity for improvement, we classify it and then create representative eval cases for it. We can then fix it and measure improvement in subsequent eval runs.

Model training: We use rubrics and graders from evals to provide signal to tune models for spreadsheet work.

Inform the right model for the job: Comparable benchmark results help us route tasks to models that balance quality, latency, and capability, to provide the best experience to customers.

Over time, this creates a virtuous cycle: feedback and usage identify opportunities, evals quantify them, fixes and training address them, benchmarks confirm the win, and the next release starts from a higher bar.

Why the hero scenario matters

Coming back to the valuation prompt, the reason it matters is that it represents the kind of work people increasingly want Copilot in Excel to help with. They do not just want help inserting a chart or writing one formula. They want help completing meaningful work: understanding a model, finding an issue, improving the analysis, and making the result easier to trust.

We can only deliver that L4 experience if the underlying layers are strong. Formula edits have to be correct, feature usage has to be reliable, multi-step plans have to stay coherent, and the final workbook has to be clear, complete, and auditable. That is why our eval system measures both the broad set of things people do in Excel and the deep workflows that matter most in demanding domains like finance.

IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)

Entra ID – Deep Dive – Entra ID Authentication – Part 3

IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)

Entra ID – Deep Dive – Entra ID Authentication – Part 3

What are evals, and why do we have them?

How we organize our evals

Complexity: L1 through L4

Customer role and vertical domain

From categories to benchmarks

Public benchmarks

Private benchmarks

How eval results make Copilot in Excel better

Why the hero scenario matters

Related posts

IPv6 Dual-Stack Endpoints for Azure Container Registry (Public Preview)

Introducing the new Partner Skilling Discussion Board on Tech Community

Right-Sizing Intelligence: Building a Multimodal Model Portfolio for Healthcare and Life Sciences

Introducing the new Partner Skilling Discussion Board on Tech Community