Playbook for Trustworthy Third‑Party Evaluations of Frontier Models

OpenAI

Document the claim and harness details in evaluation reports to ensure validity and reproducibility.

What to do now

Create a standard evaluation report template that requires claim, harness, evidence, and validity checks, and enforce its use for all future evaluations.

Summary

May 29, 2026 – A new playbook for trustworthy third‑party evaluations of frontier models is released, outlining how to design harnesses that reflect real‑world tool use, state tracking, and multi‑step workflows. The authors argue that the harness – the environment, tools, and budget – can change a model’s performance, so evaluation reports must explicitly state the claim being tested and the evidence that validates the result. They identify three claim buckets: capability elicitation, safeguard performance, and controlled comparison, and list validity checks such as reward hacking, refusals, contamination, broken problems, and sandbagging. The playbook cites GPT‑5.5’s improved performance when a compaction‑enabled harness preserves context, and UK AISI’s cyber‑range study where raising the token budget from 10 M to 100 M tokens boosted success rates by up to 59 %. It also highlights METR’s shared‑task framework that uses reusable scaffolds like Triframe and ReAct to produce comparable estimates across systems. The authors recommend that evaluation reports include a detailed description of the harness, the budget, and the scoring method to allow readers to assess the robustness of the claim. They caution that standardized harnesses may under‑elicit capability, so bespoke harnesses are preferable when feasible.

Key changes

Harness choice can alter measured capability, e.g., GPT‑5.5 performs better with compaction‑enabled context preservation
UK AISI cyber‑range study shows token budget increase from 10 M to 100 M boosts success by up to 59 %
Evaluation reports must state the claim tested and evidence validating the result
Three claim buckets: capability elicitation, safeguard performance, controlled comparison
Validity checks include reward hacking, refusals, contamination, broken problems, sandbagging
Standardized harnesses may under‑elicit capability; bespoke harnesses recommended
METR framework uses shared tasks and reusable scaffolds like Triframe and ReAct

Affects

internal

Story evolution

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting