Briefing

Engineering Self‑Improving Tax Agents with Codex

ai-dev
OpenAI

Use Codex to build self‑improving agents by capturing practitioner feedback, production traces, and creating a Codex‑driven iteration loop.

What to do now

Implement a Codex‑driven iteration loop that captures practitioner corrections as structured findings and generates targeted evals.

Summary

Thrive Holdings and OpenAI collaborated to build Tax AI, a self‑improving agent that processes tax returns for a network of 30+ accounting firms.

The system captured practitioner feedback, production traces, and used Codex to generate targeted evaluations and engineering tasks, creating a continuous improvement loop. In the pilot, Tax AI processed 7,000 returns, increasing throughput by 50% and achieving 97% draft accuracy. Accuracy at the 75% correct field completion threshold rose from 25% at launch to 86% within six weeks, and the system handled increasingly complex filings such as K‑1s and rental schedules. The architecture records the full path from source documents to final submission, enabling precise root‑cause analysis of errors. Codex’s agentic capabilities allowed the system to autonomously investigate failures and propose fixes, reducing manual engineering effort. The project illustrates how Codex can be leveraged to build self‑learning agents that improve over time in real‑world tax preparation workflows.

Key changes

  • Processed 7,000 tax returns in pilot, increasing throughput by 50%.
  • Accuracy at 75% correct field completion rose from 25% to 86% in six weeks.
  • Draft accuracy reached 97% with minimal practitioner correction.
  • Captured practitioner feedback, production traces, and used Codex for targeted evals.
  • Codex‑driven iteration loop automatically investigated failures and proposed fixes.
  • System handled increasingly complex filings (K‑1s, rental schedules).
  • Reduced manual engineering effort by converting corrections into structured findings.
  • Demonstrated self‑improving agent that improves over time in real‑world tax workflows.

Affects

internal

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting