The article examines AI vs the sports betting market (KellyBench), reporting that frontier AI models tested throughout the full 2023–24 English Premier League season lost money when evaluated on KellyBench. Eight top models were evaluated for the benchmark, with examples including Claude, Grok (notably Grok 4.20), Gemini (including Gemini Flash and Gemini 3.1 Pro), and GPT-5.4 among the set. All models finished the runs with losses; some runs led to bankruptcy or forfeiture, and researchers observed that models often articulated strategy but failed to implement it profitably.
KellyBench is named after the Kelly criterion and is structured around 120 matchdays of English Premier League fixtures, operating within a constantly shifting market environment. The benchmark’s timeline covers an entire season and places agents in a setting where odds and market conditions change continuously across the 120 matchdays and the season’s dynamics.
KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions, to monitor the consequences of those decisions, and to close the loop between observation and action. These requirements make the benchmark focus on sustained decision coherence and the continuous linking of observations to actions rather than isolated predictions.
The benchmark’s use of a constantly shifting market introduces non-stationarity that agents must address while preserving coherent, long-horizon strategies.
Eight top models were evaluated for the full 2023–24 English Premier League season on KellyBench. The evaluated set included models such as Claude, Grok, Gemini, and GPT-5.4 among others. Across the benchmark runs, all eight models lost money and some individual runs resulted in bankruptcy or forfeiture. These outcomes were recorded during the season-long benchmark that used the 120 matchdays market framework.
Grok 4.20 went bankrupt in one run. Gemini Flash forfeited two of three runs after placing a single wager of roughly £273,000 and losing it. Claude Opus 4.6 produced an average loss of 11 percent across its runs. Dixon-Coles, described as an outdated 2000s baseline, finished ahead of six out of the eight evaluated models.
The above statements report the observed individual outcomes from the KellyBench runs during the 2023–24 English Premier League season. No further interpretation is provided here.
Researchers observed a ‘knowledge-action gap’ in the performance of frontier AI models on KellyBench. This gap was characterized by the models’ ability to articulate strategies but their failure in execution. The researchers stated, “KellyBench requires agents to maintain coherent intent across potentially thousands of sequential decisions, monitor the consequences of those decisions, and close the loop between observation and action.” Furthermore, they noted the persistent challenge of non-stationarity in the market that many models failed to address effectively.
A significant observation was made concerning the Dixon-Coles model, which researchers described as “an outdated 2000s baseline which doesn’t utilise all available data or account for non-stationarity in a principled way.” Despite its limitations, Dixon-Coles surprisingly outperformed several frontier models, including some newer versions such as Gemini 3.1 Pro. The researchers expressed surprise, stating, “It is therefore even more surprising that many frontier models, such as Gemini 3.1 Pro, are unable to beat or match it on KellyBench.” These critical observations underscore the challenges posed by the KellyBench framework.
Frontier AI models were evaluated on KellyBench across the 2023–24 English Premier League season and failed to produce profitable betting results. Researchers noted that the models often articulated strategy but consistently failed in execution, indicating persistent execution challenges despite articulated strategy. These outcomes occurred over the full season-long KellyBench framework and reflect the models’ difficulty closing the loop between observation and action.


