Microsoft combined GPT and Claude models in the Copilot Researcher Critique: a multi-model deep research system for complex tasks. On the DRACO benchmark, which covers 100 complex research tasks across 10 domains including medicine, law and technology, Copilot with Critique scored 57.4 points while Anthropic’s Claude Opus 4.6 scored 42.7 points. Critique separates generation from evaluation, uses a combination of models from Frontier Labs including Anthropic and OpenAI, and the combined system outperforms the next best result by nearly 14%.
Copilot Researcher Critique: a multi-model deep research system
Critique separates generation from evaluation and uses a combination of models from Frontier Labs, including Anthropic and OpenAI. The system is presented as a multi-model deep research system designed for complex research tasks and is intended to address the problem that many AI research tools use a single model for both generation and evaluation. Models within Critique are assigned distinct operational roles so work is split between drafting and reviewing. This separation is described as central to the architecture.
“Critique is a new multi model deep research system designed for complex research tasks. It separates generation from evaluation and utilizes a combination of models from Frontier labs, including Anthropic and OpenAI.” “One model leads the generation phase, planning the task, iterating through retrieval, and producing an initial draft, while a second model focuses on review and refinement, acting as an expert reviewer before the final report is produced.
Copilot Researcher Critique: a multi-model deep research system
Microsoft describes Critique as a multi-model deep research system that separates generation from evaluation and leverages a combination of models from Frontier Labs, including Anthropic and OpenAI. One model leads the generation phase by planning the task, iterating retrieval, and producing an initial draft; in the Critique workflow, GPT handles this phase by generating content and retrieving sources. The second model concentrates on review and refinement, acting as an expert reviewer before the final report is completed. Claude serves as the editor in the review phase, focusing on factual accuracy and assembling citations to support the draft.
The design addresses a basic problem in current AI research tools, where a single model performs both generation and evaluation. By assigning distinct roles, Critique separates drafting and review duties between models to create a sequential workflow that ends with a final report produced after the review phase. The role division explicitly places generation and retrieval tasks with GPT and editorial verification and citation work with Claude.
The DRACO benchmark covers 100 complex research tasks across 10 domains, including medicine, law and technology. Copilot Critique’s performance was evaluated on the DRACO benchmark. The benchmark’s scope across 10 domains and 100 tasks is the evaluation context cited for the system in the article.
DRACO served as the referenced benchmark for assessing the system in the article. The article specifies the benchmark’s coverage of multiple disciplines, including medicine, law and technology.
Microsoft announced Critique and Council for Copilot’s Researcher, presenting Critique as a new multi-model deep research system designed for complex research tasks. The announcement and the reporting on these products appear in the article under the Markets category, and the article names both Critique and Council for Copilot’s Researcher among Microsoft’s announced items.


