AI Models Struggle to Fix Real-World CVEs – Benchmark Results Updated

by logickkk1 · OpenAI CVE-2026-26331 CVE-2026-33175 CVE-2026-40864 CVE-2026-42561 CVE-2026-44431

Test your models against CVE‑Bench; the best solve rate is 50 % overall, 60 % with full advisory.

What to do now

Run CVE‑Bench on your models to benchmark vulnerability repair; adjust prompt strategy based on diagnose/locate performance.

Summary

Researchers released CVE‑Bench, a benchmark of 20 real CVEs from 18 Python projects such as Pillow, GitPython, yt‑dlp and urllib3, to evaluate AI code‑repair models.

Five frontier models – three from OpenAI (GPT‑5.5, GPT‑4.5, GPT‑4) and two from Poolside (Laguna and another) – were tested under three prompt conditions: full advisory, diagnose (behavioral description only) and locate (file+function location only). The corrected evaluation shows GPT‑5.5 achieves a 50 % overall solve rate, rising to 60 % when a full advisory is provided, while other models lag behind. Cross‑family pairwise comparisons now reach statistical significance (p ≤ 0.040) after fixing five faulty tests, but within‑family comparisons remain non‑significant. Token costs vary up to four times for equivalent fixes, and failure modes such as wrong‑search drift, budget exhaustion and partial fixes are consistently observed. The benchmark also highlights that locate prompts, which give only a location, are the most challenging, mirroring real security researcher workflows. Researchers emphasize that while no model reliably fixes all vulnerabilities, the data can guide future improvements in AI‑assisted security tooling.

Key changes

CVE‑Bench benchmark created with 20 real CVEs across 18 Python projects
Three prompt types: advisory, diagnose, locate
Five models tested: GPT‑5.5, GPT‑4.5, GPT‑4, Poolside Laguna, Poolside other
GPT‑5.5 solve rate 50 % overall, 60 % with full advisory
Cross‑family pairwise comparisons now significant at p≤0.040 after test corrections
Token cost varies 4× for equivalent outcomes
Failure modes identified: wrong‑search drift, budget exhaustion, partial fixes
All numbers and statistical conclusions updated after correcting tests

Affects

internal

Story evolution

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting