AI still cannot beat the on-call engineer. The ARFBench benchmark evaluated 63 real production incidents, compiling more than 5.38 million data points to characterize model performance on incident response tasks. GPT-5 produced accuracy results of approximately 62.7% on Tier I–III questions within that benchmark, and those figures are presented as central benchmarking facts in discussions of AI versus human engineers.
The ARFBench benchmark covers 63 real production incidents and includes 750 multiple-choice questions, 142 distinct monitoring metrics, and more than 5.38 million data points. These data points were derived from engineers’ Slack threads captured during live emergencies, and the benchmark contains no synthetic data or textbook scenarios. The dataset emphasizes real-world observability signals and human troubleshooting interactions rather than simulated incidents. The benchmark is explicitly oriented toward assessing performance on incident response tasks.
The benchmark’s questions span Tier I–III tasks designed to reflect the kinds of time-series and cross-metric reasoning engineers perform during outages. ARFBench was developed as a joint project between Datadog and Carnegie Mellon. The design aggregates monitoring metrics and human conversational context to form multiple-choice evaluation items. The benchmark’s construction intentionally avoided synthetic augmentation and hypothetical cases.
ARFBench provides a structured, empirically grounded dataset for evaluating AI models against human engineers in incident response. The collaboration between Datadog and Carnegie Mellon produced a reproducible, real-world benchmark for comparing model and human performance.
The ARFBench results list accuracy scores for multiple AI models on Tier I–III tasks. GPT-5 scored 62.7% accuracy. Gemini 3 Pro scored 58.1%, Claude Opus 4.6 scored 54.8%, and Claude Sonnet 4.5 scored 47.2%. Hybrid entries included Toto-1.0-QA-Experimental at 63.9% and a Toto+Qwen3-VL 32B combination that edged GPT-5. A random-guessing baseline on the benchmark would be 24.5%.
Human performance baselines on the same benchmark were higher than the AI models listed. Domain experts scored 72.7% accuracy and non-domain experts scored 69.7%. The benchmark report states that no AI model beat either human baseline. These figures appear in the ARFBench leaderboard and summary of results. Leaderboard notes indicated Toto-1.0-QA-Experimental slightly outscored GPT-5 in the public results. All listed AI model scores are accuracy percentages reported on Tier I–III tasks.
Tier III questions in the ARFBench benchmark require cross-metric reasoning across monitoring metrics and time-series signals. ARFBench reports that AI models struggle with these Tier III questions. The benchmark results also report that no AI model beat either human baseline in overall accuracy. The inclusion of Tier III cross-metric items is highlighted in ARFBench as a factor in evaluating model performance on realistic incident response tasks.
ARFBench results show that no AI model surpassed human engineers’ accuracy on incident response tasks. The report includes the statement, “A purpose-built domain model, trained on observability data, outperforming a frontier general-purpose system at this specific task is the expected outcome. That’s the point.” The report also explicitly includes the line, “The most valuable finding isn’t which model scored highest.”


