Mathesis: Towards Formal Theorem Proving from Natural Languages

Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li·June 08, 2025

Summary

Mathesis combines an autoformalizer with reinforcement learning for efficient formalization and proof generation, achieving 64% accuracy on MiniF2F and 18% on Gaokao-Formal. Its Mathesis-Autoformalizer surpasses Kimina-Autoformalizer on benchmarks, using online reinforcement learning to translate natural language math problems into formal language. A new Gaokao-Formal benchmark and LeanScorer evaluation framework improve end-to-end theorem proving and autoformalization assessment. Four papers discuss automated theorem proving, neural theorem proving advancements, and the use of reinforcement learning, Monte Carlo tree search, and large language models to enhance mathematical reasoning.

Introduction
Background
Overview of autoformalization and its importance in formalizing mathematical proofs
Brief history of autoformalization systems and their limitations
Objective
The goal of Mathesis in addressing the challenges of autoformalization and proof generation
The aim of achieving high accuracy on benchmarks like MiniF2F and Gaokao-Formal
Method
Data Collection
Description of the datasets used for training and testing Mathesis
Importance of diverse and challenging datasets in improving system performance
Data Preprocessing
Techniques used for preparing natural language math problems for autoformalization
The role of preprocessing in enhancing the effectiveness of reinforcement learning
Reinforcement Learning Integration
Explanation of how reinforcement learning is used in Mathesis-Autoformalizer
The process of translating natural language math problems into formal language
The role of online reinforcement learning in improving the system's performance over time
Benchmarks and Evaluation
Gaokao-Formal Benchmark
Description of the Gaokao-Formal benchmark and its significance
How the benchmark evaluates the system's ability to formalize and prove mathematical theorems
LeanScorer Evaluation Framework
Overview of the LeanScorer framework and its role in assessing end-to-end theorem proving and autoformalization
The importance of a comprehensive evaluation framework in validating system performance
Contributions and Findings
Automated Theorem Proving
Discussion of advancements in automated theorem proving facilitated by Mathesis
The impact of Mathesis on the field of automated reasoning
Neural Theorem Proving
Insights into the integration of neural networks in theorem proving
The benefits of using neural networks for enhancing mathematical reasoning
Reinforcement Learning, Monte Carlo Tree Search, and Large Language Models
Analysis of how reinforcement learning, Monte Carlo tree search, and large language models contribute to Mathesis
Case studies demonstrating the effectiveness of these techniques in improving system performance
Conclusion
Summary of Achievements
Recap of Mathesis's performance on MiniF2F and Gaokao-Formal benchmarks
The significance of surpassing Kimina-Autoformalizer on benchmarks
Future Directions
Potential areas for further research and development in autoformalization and theorem proving
The role of Mathesis in shaping the future of mathematical reasoning systems
Basic info
papers
artificial intelligence
Advanced features
Insights
What are the key innovations of Mathesis-Autoformalizer compared to Kimina-Autoformalizer, particularly in the context of autoformalization benchmarks?
How does Mathesis leverage reinforcement learning to improve the translation of natural language math problems into formal language?
What is the LeanScorer evaluation framework, and how does it contribute to the assessment of end-to-end theorem proving and autoformalization?
What role do large language models and Monte Carlo tree search play in enhancing mathematical reasoning within the context of automated theorem proving?