AI Models Struggle to Fix Real-World CVEs – Benchmark Results Updated
Test your models against CVE‑Bench; the best solve rate is 50 % overall, 60 % with full advisory.
Run CVE‑Bench on your models to benchmark vulnerability repair; adjust prompt strategy based on diagnose/locate performance.
Summary
Researchers released CVE‑Bench, a benchmark of 20 real CVEs from 18 Python projects such as Pillow, GitPython, yt‑dlp and urllib3, to evaluate AI code‑repair models.
Five frontier models – three from OpenAI (GPT‑5.5, GPT‑4.5, GPT‑4) and two from Poolside (Laguna and another) – were tested under three prompt conditions: full advisory, diagnose (behavioral description only) and locate (file+function location only). The corrected evaluation shows GPT‑5.5 achieves a 50 % overall solve rate, rising to 60 % when a full advisory is provided, while other models lag behind. Cross‑family pairwise comparisons now reach statistical significance (p ≤ 0.040) after fixing five faulty tests, but within‑family comparisons remain non‑significant. Token costs vary up to four times for equivalent fixes, and failure modes such as wrong‑search drift, budget exhaustion and partial fixes are consistently observed. The benchmark also highlights that locate prompts, which give only a location, are the most challenging, mirroring real security researcher workflows. Researchers emphasize that while no model reliably fixes all vulnerabilities, the data can guide future improvements in AI‑assisted security tooling.
Key changes
- CVE‑Bench benchmark created with 20 real CVEs across 18 Python projects
- Three prompt types: advisory, diagnose, locate
- Five models tested: GPT‑5.5, GPT‑4.5, GPT‑4, Poolside Laguna, Poolside other
- GPT‑5.5 solve rate 50 % overall, 60 % with full advisory
- Cross‑family pairwise comparisons now significant at p≤0.040 after test corrections
- Token cost varies 4× for equivalent outcomes
- Failure modes identified: wrong‑search drift, budget exhaustion, partial fixes
- All numbers and statistical conclusions updated after correcting tests