Altay AI
All insights
Governance·2 min read

Evals are the new internal control

Why evaluation harnesses belong in your control framework — and how to design them so finance can sign off.

Altay AI · 8 April 2026

If you've ever signed a SOX narrative, you already understand evals. You just haven't called them that yet.

An internal control is a repeatable check that a process is producing the outcome it claims to produce. An eval is a repeatable check that an AI agent is producing the outcome it claims to produce. Same shape. Different vocabulary.

The problem is that evals usually live in the engineering team's notebooks, not in the controls matrix. Which means when the model changes — or the prompt changes, or the vendor pushes an update — finance finds out after the fact. That's not a control environment. That's a hope environment.

What a finance-grade eval looks like

Four properties, all unglamorous:

  1. Owned by finance. The person who signs the rep letter signs the eval set.
  2. Versioned. Test cases live in a repo with a change history, not on someone's laptop.
  3. Run on every change. Model upgrade, prompt edit, vendor patch — all gated by the same harness.
  4. Scored deterministically. "Looks good" is not a score. Define what passing means in advance.

A worked example: flux commentary

Say you've deployed an agent to draft variance commentary for the management pack. The eval set might be:

  • 30 historical variances with known, signed-off commentary
  • For each, the agent's draft is scored against the human version on three axes: factual accuracy, materiality framing, and tone
  • A pass requires ≥90% on accuracy, ≥85% on the other two

Run it monthly. Run it after every prompt change. Run it after the vendor's release notes mention "improved reasoning."

When the scores drop, you have a control finding. When they hold, you have evidence for the auditor.

Why this matters now

Auditors are asking. Boards are asking. Regulators are starting to ask. The teams that already have eval harnesses in their control framework will answer those questions with a dashboard. The teams that don't will answer them with a project plan.

We know which one we'd rather present.