A Stanford-led study found that AI outperformed law professors in legal reasoning, with models selected over human instructors in head-to-head assessments. The study reported 2,918 blinded comparisons between AI models and human instructors, involved 16 professors from 14 U.S. law schools, and used 40 contract law questions covering legal doctrine, case law, hypotheticals, and policy issues.
Two high-performing models were Google Gemini 2.5 Pro, which won 75.92% of its matchups against human instructors, and NotebookLM, which won 74.75%, with AI models outperforming humans in recall questions, hypotheticals, and policy discussions.
Google Gemini 2.5 Pro and NotebookLM were top-performing models in the Stanford-led study, with Gemini winning 75.92% and NotebookLM winning 74.75% of their matchups against human instructors. Those win percentages were reported from direct, blinded comparisons between AI models and human instructors. The study carried out 2,918 blinded comparisons and involved 16 professors from 14 U.S. law schools. Researchers used a set of 40 contract law questions that covered legal doctrine, case law, hypotheticals, and policy issues.
Among additional models evaluated, Anthropic’s Claude Opus 4.7 ranked first, followed by OpenAI’s ChatGPT 5.4 and Gemini 2.5 Pro in subsequent positions. Across the tested question types, AI models outperformed human instructors in recall questions, hypotheticals, and policy discussions. The comparative rankings and win rates were reported for multiple models included in the study.
The study cautioned that it did not measure alignment with any individual professor’s teaching preferences. It stated that AI responses may be generally acceptable rather than tailored to an individual instructor. Those caveats were reported alongside the numerical performance and ranking results.
The study’s experimental design used 2,918 blinded comparisons between AI models and human instructors, with 16 professors from 14 U.S. law schools participating. Researchers created 40 contract law questions that covered legal doctrine, case law, hypotheticals, and policy issues. Each comparison presented AI-generated and human-written answers in a blinded format for professor evaluation. The study recorded outcomes across those head-to-head assessments.
To probe surface-level writing factors, the study engineered a set of lexico-syntactic features and measured their association with preference patterns. The features examined included answer length, structural organization, reasoning nuance, legal anchors, confidence tone, clarity, and pedagogical support. The analysis tested how much of the preference pattern these features could explain relative to substantive content. The study reported these feature analyses alongside the blinded comparison results.
The methodology section emphasized the blinded, comparative design and the engineered lexico-syntactic feature set. The study noted that this analysis aimed to differentiate surface-level writing style from substantive content. These procedural details were reported alongside the numerical comparison outcomes.
The study reported harmfulness rates when comparing AI-generated answers to human-written answers: Google Gemini 2.5 Pro recorded a 3.41% harmfulness rate, NotebookLM recorded 3.64%, and human instructors recorded 12.06%. These rates were reported as part of the study’s evaluation metrics. The harmfulness comparisons were presented alongside the models’ performance metrics in blinded head-to-head assessments. The study included 2,918 blinded comparisons and 16 professors from 14 U.S. law schools.
The study cautioned that it did not measure alignment of AI responses with any individual professor’s teaching preferences. It stated that AI responses may be generally acceptable rather than tailored to an individual instructor. These caveats were reported alongside the harmfulness and performance results.
The Stanford-led study demonstrated that AI models outperformed law professors in legal reasoning in head-to-head, blinded assessments. The study presented its results with an analytical and cautious approach, emphasizing methodology, engineered lexico-syntactic analyses, and limitations. It also reported harmfulness comparisons and cautioned that it did not measure AI response alignment with any individual professor’s teaching preferences, noting that AI responses may be generally acceptable rather than tailored.


