When AI Solves Math Like a Prodigy: DeepMind vs OpenAI at the Olympiad

Artificial intelligence has leapt from pattern recognition to genuine mathematical reasoning, challenging even the brightest highschool minds. At this year’s International Mathematical Olympiad (IMO), rival systems from DeepMind and OpenAI both tackled the same six fiendishly difficult problems—yet their approaches, results, and broader implications reveal important distinctions.

The New Olympians: Tech Titans Enter the Arena

For decades, the IMO has represented the pinnacle of pre-college mathematical prowess: six problems in two 4.5-hour sessions, no calculators, no rewrites. This year in Australia, DeepMind’s advanced Gemini Deep Think answered five out of six problems correctly in a single natural-language pass—equivalent to the 35 points that earn a gold medal. Two days earlier, OpenAI quietly disclosed a system matching that result, solving five problems off-record in demo runs.

Unlike earlier efforts, both systems needed no formal theorem-prover translations—the questions were fed in plain English, and answers emerged conversationally. Yet their pipelines differed substantially:

DeepMind committed to a one-shot paradigm, mimicking contest constraints with no retries or human-curated hints.
OpenAI employed iterative prompting, consensus sampling, and occasional prompt rewrites behind the scenes, yielding glossy final solutions without publicly detailing time limits or retry allowances—a strategy criticized by Fields Medalist Terence Tao as “not the IMO either,” since it permits cherry-picking and collaboration that real contestants can’t use.

DeepMind’s Gold: A Silver Turned Shine

DeepMind first demonstrated a “silver-standard” edition of Gemini solving four problems in 2020, then progressively refined its architecture—most recently incorporating advanced scratch-pad reasoning and model-based reflection. The latest Gemini Deep Think version elevated performance to gold, achieving 83% accuracy on the American Invitational Math Exam (AIME) style questions and full five-problem solves under IMO-like constraints.

Real-world impact extends beyond medals: DeepMind’s approach emphasizes

Strict evaluation fidelity, preserving time and tool restrictions,
Unified natural-language reasoning, avoiding reliance on proof assistant translations, and
Scalability of reflection techniques, enabling the model to critique its own solutions step by step.

This rigor means DeepMind’s results most closely mirror human conditions, offering mathematicians a reliable “sparring partner” that can propose plausible solution outlines in real time.

OpenAI’s Lab-Wrought Champion

OpenAI’s system—an evolution of GPT-4-derived logic guided by intensive sampling and learned re-ranking—also reached a five-problem solve. In informal tests, it averaged 12% on the 2024 AIME with single sampling but soared to 93% with heavy consensus sampling and re-ranking, surpassing the cutoff for USA IMO consideration. However, these results hinge on

Massive sampling budgets,
Re-runs and prompt tweaks, and
Selective reporting of successes.

Terence Tao warns that this “lab-optimized” pipeline delivers polished answers but sidesteps the contest’s real-time pressure, making direct comparisons with human contestants misleading.

Real-World Implications and Cautions

Both systems push the frontier of AI-driven mathematical discovery—but practitioners and educators must ask: What do these feats truly represent?

Research acceleration: AI co-authors could vet conjectures, propose counterexamples, and draft proof sketches, potentially transforming collaboration in pure mathematics.
Educational tools: Intelligent tutors may diagnose where a student errs, offering tailored hints, not just final answers.
Fair evaluation: Benchmarking under identical constraints is crucial to measure genuine “reasoning” rather than brute-force trial.

Yet, overhype risks conflating optimized demos with general problem-solving prowess. Real Olympiad problems often require creativity, insight leaps, and novel strategies—qualities still unevenly captured by large language models.

Balancing Optimism with Realism

AI’s performance at the Olympiad is undeniably impressive. DeepMind’s natural-language, single-pass success echoes a true gold medal, while OpenAI’s high-powered consensus methods showcase the virtues of scale and sampling. For readers pondering AI’s next steps:

Embrace these models as collaborative tools, not replacements for human ingenuity.
Demand transparent benchmarks—disclosing time budgets, retry limits, and evaluation protocols.
Recognize that algorithmic insight still lags behind domain expertise in many open-ended proof tasks.

As AI continues to refine its scratch-pad reasoning and self-critique mechanisms, expect richer synergies between humans and machines—where AI offers conjectures and drafts, and mathematicians supply the creative spark. In the high-stakes crucible of the Olympiad, today’s gold-medalist AIs herald a future of co-creative discovery.

Ethical Considerations and Fairness

The rise of AI in competitive mathematics raises pressing ethical questions around fairness, access, and academic integrity. Institutions must guard against misuse—such as students passing off AI-generated proofs as their own—while harnessing these tools to democratize advanced learning. Establishing clear guidelines on AI assistance, citation norms, and permissible use cases will be essential to preserve both credibility and innovation.

The Road Ahead: Beyond Olympiad Problems

While the IMO benchmark illustrates raw problem-solving capability, the true frontier lies in open-ended research: tackling unsolved conjectures, formalizing new theories, and collaborating across disciplines. Future AI models may not only match human champions in timed contests but also drive breakthroughs in number theory, topology, and mathematical physics—blurring the line between student and co-investigator.