Briefing

Back to Articles ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

ai-dev
Claude OpenAI

Run: Evaluate your own agent on ITBench‑AA SRE using Stirrup harness to benchmark performance.

What to do now

Run: Evaluate your own agent on ITBench‑AA SRE using Stirrup harness to benchmark performance.

Summary

ITBench‑AA, launched on May 27 2026 by Artificial Analysis and IBM, evaluates frontier models on 59 Kubernetes incident‑response tasks, split into 40 public and 19 new held‑out cases. The leaderboard shows Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 47 %, GPT‑5.5 (xhigh) at 46 %, and Qwen3.7 Max at 42 %, all scoring below the 50 % threshold and highlighting a gap in agentic IT capabilities. Turn‑count analysis reveals that GPT‑5.5 averages 31 turns per task while Gemini 3.1 Pro Preview averages 83 turns, yet longer trajectories do not translate into higher accuracy due to false‑positive penalties for extra root‑cause entities. Cost‑per‑task comparisons show Gemma 4 31B achieving 37 % at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23) and GLM‑5.1 ($1.23). The benchmark uses the open‑source Stirrup harness with a 100‑turn cap and 3 repeats per task, ensuring consistent evaluation across models. The dataset includes alerts, events, traces, metrics, logs, and topology snapshots, requiring agents to identify minimal sets of independent root‑cause Kubernetes entities. The results suggest that current frontier models still struggle with complex SRE scenarios, especially when penalised for over‑investigation. Future iterations of ITBench‑AA will expand to FinOps and CISO tasks, providing a broader view of enterprise IT agent performance.

Key changes

  • 59 SRE tasks (40 public, 19 new)
  • Claude Opus 4.7 leads at 47 %
  • GPT‑5.5 at 46 %, Qwen3.7 Max at 42 %
  • Turn‑count: GPT‑5.5 avg 31 turns, Gemini 3.1 Pro avg 83 turns
  • Gemma 4 31B scores 37 % at $0.14/task
  • GLM‑5.1 scores 40 % at $1.23/task
  • Benchmark uses Stirrup harness, 100‑turn cap, 3 repeats
  • Agents penalised for extra root‑cause entities

Affects

enterprise

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting