Aleph Prover is a tool that automatically produces machine-checked proofs in Lean 4. On PutnamBench, a collection of problems from the Putnam Competition*, Aleph solved 500/660 problems, placing first among all provers on the benchmark.
*The Putnam Competition is one of the world's most challenging mathematics competitions.
- Aleph Prover: 500 / 660 problems on PutnamBench
- First place on leaderboard as of November 30, 2025
- Average proof time: ≈ 2h 48m
- Average cost: ≈ $23 per solved problem
- Average lemmas per proof: ≈ 15
- Previous best on leaderboard: 462 / 660
About Lean 4
Lean 4 is one of the most popular languages used for machine-checked theorem proving and is the language most widely used by AI tools for automated proving. Unlike proofs written in human languages, Lean 4 allows you to provide the compiler with a formal statement of a theorem and a mathematical proof, and it will automatically verify that the proof is correct or point out errors.
We created Aleph Prover for formal code verification, to provide machine-checked proofs of correctness for code written in any programming language. It has been fine-tuned specifically for proving properties of code. However, it turned out that it also performs very well in solving mathematical competition problems formulated in Lean 4.
While Aleph Prover isn't publicly available yet, we've evaluated it on PutnamBench, a large collection of Lean 4 translations of Putnam competition problems: Aleph solved 500 of 660 problems on a single attempt per problem, surpassing the best prior result on the public leaderboard as of November 30, 2025.
Putnam Competition
The Putnam Competition (officially the William Lowell Putnam Mathematical Competition) is widely regarded as the most important undergraduate mathematics competition in the United States and Canada. Teams from Harvard and MIT have been among the most successful participants. The authors of PutnamBench emphasize the exceptional difficulty of these problems: "The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists."
PutnamBench
A team of researchers from the University of Texas at Austin recently published a paper introducing PutnamBench. The paper describes the manual translation of Putnam competition problems from English into formal languages (including Lean 4). Over many years, statements of 660 problems were translated in this way.
The details about this study can be found at: trishullab.github.io/PutnamBench
Hilbert, created by 4 authors from Apple and 2 from UC San Diego, previously held the best result with 462 solved, which was surpassed on the same benchmark. See the leaderboard for details.
Closing the Gap Between Formal and Informal Proofs
The paper introducing Hilbert Prover highlights a big historical gap: strong reasoning models can solve most PutnamBench problems informally (in natural language), but until recently only a small fraction of those solutions could be turned into fully formal Lean proofs. Hilbert narrows that gap by combining an informal "reasoner" LLM, a specialized Lean‑4 prover, the Lean verifier, and retrieval over Mathlib, boosting formal success on PutnamBench to about 70% of problems.
Aleph Prover pushes even closer to informal performance, formally solving 500 of 660 PutnamBench Lean problems (≈76%). That means formal, machine‑checked proofs are now approaching the coverage of informal solutions that general‑purpose LLMs can produce. This brings us closer to a world where the proofs we trust are as capable as the proofs we can imagine.
Run Statistics
Across the 500 successful proofs, we summarize the distributions for time to proof, lines of Lean code, Lean-check calls, and cost per problem. For transparency, each problem was launched once but with many internal Lean checks as needed.
Time Statistics
- Minimum proof time: 2 minutes 31 seconds
- Maximum proof time: 12 hours 50 minutes
- Average proof time: 2 hours 48 minutes
Lines of Code
- Average: 730.23 lines per successful proof
- Minimum: 7 lines
- Maximum: 3,617 lines
Lean Check Calls
Each problem was launched exactly once but with many internal Lean-check calls as needed.
- Minimum: 4 calls
- Maximum: 37,061 calls
- Average: 1,841 calls
Lemmas & Theorems
Each successful proof required on average 15 proved theorems or lemmas, with a minimum of 1 and a maximum of 62. In total 7,497 theorems or lemmas were proved.
Cost Analysis
We enforced a soft limit of $100 per problem.
- Minimum cost: $0.09
- Maximum cost: $103.13
- Average cost: $23.39
- Total cost: $11,669.37
Cost Correlations
Below are charts showing the relationship between cost and time, as well as cost and final code size.
Cost vs Time Correlation
Cost vs Lines of Code