Logical

Intelligence

Logical

Intelligence

Logical

Intelligence

Aleph prover

Aleph
prover

Aleph prover

State-of-the-Art Formal Theorem Prover

Aleph Prover is a tool that automatically produces machine-checked proofs in Lean 4. On PutnamBench, a collection of problems from the Putnam Competition*, Aleph solved 668/672 problems, placing first among all provers on the benchmark.

*The Putnam Competition is one of the world's most challenging mathematics competitions.

Aleph Prover: 668 / 672 (99.4%) problems on PutnamBench

First place on leaderboard as of January 6, 2026
Average proof time: ≈ 6h 5m
Average lemmas per proof: ≈ 18
Average cost: ≈ $68 per solved problem
Maximum cost: $1,012
Stats for new 31 problems below, for previous runs see reports on first 500 problems and 137 additional problems

Previous best on leaderboard: 581 / 672 (as of January 6, 2026)

Aleph Prover:

668 / 672 (99.4%)

problems on PutnamBench

First place on leaderboard as of January 6, 2026
Average proof time: ≈ 6h 5m
Average lemmas per proof: ≈ 18
Average cost: ≈ $68 per solved problem
Maximum cost: $1,012
Stats for new 31 problems below, for previous runs see reports on first 500 problems and 137 additional problems

Previous best on leaderboard:

581 / 672 (as of January 6, 2026)

Notes on This Run

We previously published reports on earlier runs of Aleph Prover: the first run with a $100 per problem limit solved 500 problems, and the second run with a $300 per problem limit solved an additional 137 problems. Those runs already achieved first place on the leaderboard. For this new run, we launched Aleph on the remaining 35 problems that were not solved in previous runs, this time with a $1,000 per problem limit. This resulted in 31 additional problems solved. In the PutnamBench leaderboard, we noted that there were three runs: the original one with a $100 limit, a second one on the remaining problems with a $300 limit, and now a third one with a $1,000 limit. The results of the new run were submitted to the PutnamBench authors for verification on January 4 and published on January 6.

Fixing Formalization Errors

During this run, Aleph fixed formalization errors in 12 more problems while working on their proofs, bringing the total to 27 across all runs. Aleph had access to both the Lean 4 formalization and the original English problem statement. While proving, Aleph identified that the formal statement didn't match the English problem and proposed corrections. We manually verified each fix, and all 12 corrections were accepted into the official PutnamBench dataset. These problems were attempted in earlier runs, but we filtered them out for simplicity. In this run, we thoroughly investigated each problem with statement changes.

This explains the cost patterns in this batch. The 12 problems with fixed formalizations don't include costs from previous attempts. All other problems are re-runs from previous batches, so their costs include $100 from the first run plus $300 from the second run, plus the cost of the current run.

About Lean 4

Lean 4 is one of the most popular languages used for machine-checked theorem proving and is the language most widely used by AI tools for automated proving. Unlike proofs written in human languages, Lean 4 allows you to provide the compiler with a formal statement of a theorem and a mathematical proof, and it will automatically verify that the proof is correct or point out errors.

We created Aleph Prover for formal code verification, to provide machine-checked proofs of correctness for code written in any programming language. It has been fine-tuned specifically for proving properties of code. However, it turned out that it also performs very well in solving mathematical competition problems formulated in Lean 4.

While Aleph Prover isn't publicly available yet, we've evaluated it on PutnamBench, a large collection of Lean 4 translations of Putnam competition problems: Aleph solved 668 of 672 problems on a single attempt per problem, surpassing the best prior result on the public leaderboard as of January 6, 2026.

Putnam Competition

The Putnam Competition (officially the William Lowell Putnam Mathematical Competition) is widely regarded as the most important undergraduate mathematics competition in the United States and Canada. Teams from Harvard and MIT have been among the most successful participants. The authors of PutnamBench emphasize the exceptional difficulty of these problems: "The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists."

PutnamBench

A team of researchers from the University of Texas at Austin recently published a paper introducing PutnamBench. The paper describes the manual translation of Putnam competition problems from English into formal languages (including Lean 4). Over many years, statements of 672 problems were translated in this way.

The details about this study can be found at: trishullab.github.io/PutnamBench

Aleph Prover solved 668/672 problems (99.4%) on PutnamBench, placing first among all provers.

Closing the Gap Between Formal and Informal Proofs

In previous versions of our reports (first run, second run), we discussed the gap between solving problems in natural language versus Lean 4. Our new results now reach 99.4% on Lean 4. On this benchmark, this essentially closes the question of the difference between natural language and Lean 4 solutions.

Run Statistics

Across the 31 additional successful proofs from the third run, we summarize the distributions for time to proof, lines of Lean code, Lean-check calls, and cost per problem. For transparency, each problem was launched once but with many internal Lean checks as needed. For statistics on previous runs, see our reports on first 500 problems and 137 additional problems.

Time Statistics

→ Minimum proof time: 1 hour 37 minutes

→ Maximum proof time: 69 hours 29 minutes

→ Average proof time: 25 hours 56 minutes

→ Minimum proof time: 1 hour 37 minutes

→ Maximum proof time: 69 hours 29 minutes

→ Average proof time: 25 hours 56 minutes

Lines of Code

→ Average: 1,887 lines per successful proof

→ Minimum: 45 lines

→ Maximum: 6,391 lines

→ Average: 1,887 lines per successful proof

→ Minimum: 45 lines

→ Maximum: 6,391 lines

Lean Check Calls

Each problem was launched exactly once but with many internal Lean-check calls as needed.

→ Minimum: 111 calls

→ Maximum: 23,892 calls

→ Average: 6,569 calls

→ Minimum: 111 calls

→ Maximum: 23,892 calls

→ Average: 6,569 calls

Lemmas & Theorems

Each successful proof required on average 33 proved theorems or lemmas, with a minimum of 3 and a maximum of 95. In total 1,021 theorems or lemmas were proved.