On May 1, the Center for AI Standards and Innovation (CAISI), a unit within NIST, conducted an evaluation of DeepSeek V4 Pro. The assessment revealed that DeepSeek V4 Pro lags behind the frontier AI models by approximately 8 months. In terms of performance metrics, the Elo score for DeepSeek V4 Pro was recorded to be around 800. This positions it below models like Claude Opus 4.6 and GPT-5.5, which scored 999 and 1,260 respectively.
The Center for AI Standards and Innovation (CAISI), a unit of NIST, conducted an evaluation of DeepSeek V4 Pro on May 1. The evaluation highlighted that DeepSeek V4 Pro lags the frontier AI models by about 8 months. Among the models evaluated, GPT-5.5 achieved the highest IRT-estimated Elo score of 1,260, followed by Claude Opus 4.6 with a score of 999. DeepSeek V4 Pro recorded an Elo score of approximately 800, with a margin of ±28, while GPT-5.4 mini scored 749.
In terms of performance on public benchmarks, DeepSeek V4 Pro scored 90% on the GP QA-Diamond, compared to Claude Opus 4.6’s 91%. On specific tasks like OTIS-AIME-2025, PUMaC 2024, and SMT 2025, DeepSeek V4 Pro performed admirably, scoring 97%, 96%, and 96%, respectively. However, on the SWE-Bench Verified, DeepSeek V4 Pro scored 74%, which was lower than GPT-5.5’s 81%. These results chart a competitive landscape where DeepSeek V4 Pro demonstrates strong capabilities in certain benchmarks but falls short in others against top-tier AI models.
In the cost comparison conducted, it was found that only the GPT-5.4 mini managed to clear the cost bar. However, DeepSeek V4 Pro was more cost-effective across five out of the seven benchmarks evaluated. This provides a competitive edge in pricing for DeepSeek V4 Pro compared to other prominent AI models.
In its technical report, DeepSeek claims that the V4 Pro model performs comparably to both Claude Opus 4.6 and GPT-5.4, despite the perceived lag reported by CAISI. Contrasting this evaluation, Ex0bit, a representative of DeepSeek, strongly denied the claim that DeepSeek V4 Pro is 8 months behind. The quote from Ex0bit emphatically states, “There’s no ‘gap’, and no one’s 8 months behind. We’ve been trolled on every closed U.S drop and flexed on with open weights.” This highlights the skepticism towards CAISI’s methodology and insists on the competitive positioning of DeepSeek V4 Pro.
On May 1, the Center for AI Standards and Innovation (CAISI), a unit of NIST, evaluated DeepSeek V4 Pro and concluded it lags the frontier by about eight months, reflected in IRT-estimated Elo scores: GPT-5.5 1,260; Claude Opus 4.6 999; DeepSeek V4 Pro approximately 800 (±28); and GPT-5.4 mini 749.
On public benchmarks, DeepSeek scored 90% on GPQA-Diamond (Opus 4.6 91%), 97% on OTIS-AIME-2025, 96% on PUMaC 2024, 96% on SMT 2025, and 74% on SWE-Bench Verified (GPT-5.5 81%).
DeepSeek’s technical report asserts V4 Pro matches Opus 4.6 and GPT-5.4, and Ex0bit rejected the assessment’s 8-month lag finding.


