JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen·May 23, 2024

Summary

This paper presents JiuZhang3.0, an efficient method for enhancing mathematical reasoning in large language models. It uses a cost-effective approach by training a smaller model on a synthesized dataset distilled from GPT-4, with prompts tailored to different education levels. The process involves gradient-based selection of valuable math-related texts and generating 6 million problems with limited API calls. JiuZhang3.0 outperforms existing methods on 18 evaluation datasets, demonstrating improved performance in natural language reasoning and tool manipulation tasks. The approach is based on a smaller, more affordable model and a data synthesis strategy that reduces costs compared to larger models. The code and data will be made publicly available.

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the challenge of improving mathematical reasoning by training small data synthesis models, specifically focusing on enhancing the mathematical problem-solving abilities of large language models (LLMs) . This problem is not entirely new, as previous work has also explored training LLMs on math-related data to enhance their mathematical reasoning skills . The paper highlights the effectiveness of training LLMs on math-related data to improve their performance on mathematical reasoning tasks, indicating a novel approach to enhancing mathematical problem-solving capabilities .

What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis related to improving mathematical reasoning by training small data synthesis models . The focus is on advancing reasoning generalists with preference trees, one-shot learning as instruction data prospector for large language models, and unlocking multimodal understanding across millions of tokens of context . The research delves into deliberate problem-solving with large language models, scaling instruction-finetuned language models, and tool-integrated reasoning agents for mathematical problem-solving . The study also explores the use of generative AI for math, augmenting math word problems via iterative question composing, and self-correction capabilities of large language models with tool-interactive critiquing .

What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper "JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models" proposes several new ideas, methods, and models in the field of mathematical reasoning enhancement . Here are some key points from the paper:

Advancing LLM Reasoning Generalists: The paper introduces the concept of advancing LLM reasoning generalists with preference trees, which aims to enhance mathematical reasoning capabilities .
One-Shot Learning for Large Language Models: It discusses the use of one-shot learning as an instruction data prospector for large language models, which can improve mathematical reasoning tasks .
Tool-Integrated Reasoning Agent: The paper presents Tora, a tool-integrated reasoning agent designed for mathematical problem-solving, which can assist in enhancing mathematical reasoning processes .
Generative AI for Math: It introduces "MathPile," a billion-token-scale pretraining corpus for math, as part of generative AI for math, which can contribute to improving mathematical reasoning tasks .
Large Language Models for Mathematics: The paper discusses the development of large language models like Llemma and InternLM-Math, which are specifically tailored for mathematical reasoning tasks .
Scaling Instruction-Finetuned Language Models: It explores the scaling of instruction-finetuned language models to enhance mathematical reasoning capabilities, indicating a focus on improving mathematical reasoning through model scaling .
Key-Point-Driven Data Synthesis: The paper introduces a method for key-point-driven data synthesis with enhancements on mathematical reasoning, which can aid in improving the quality of data used for mathematical tasks .
Open Language Model for Mathematics: It presents Llemma, an open language model for mathematics, which can be utilized to enhance mathematical reasoning processes .
Solving Quantitative Reasoning Problems: The paper discusses the use of language models to solve quantitative reasoning problems, highlighting the potential of these models in improving mathematical reasoning tasks .
Improving Math Problem-Solving: It introduces ChatGLM-Math, a model aimed at improving math problem-solving in large language models through a self-critique pipeline, indicating a novel approach to enhancing mathematical reasoning .

These proposed ideas, methods, and models outlined in the paper contribute to the advancement of mathematical reasoning by leveraging large language models and innovative techniques tailored for mathematical tasks. The paper "JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models" introduces several characteristics and advantages of its proposed method compared to previous approaches, as detailed in the paper:

Successive Parabolic Interpolation:
- Characteristics: The method of successive parabolic interpolation is highlighted for finding the extremum of a function with a superlinear rate of convergence of approximately 1.325, which is superior to methods like gradient descent and Newton's method .
- Advantages: One key advantage is that only function values are used in this method, eliminating the need for computing or approximating function derivatives, making it a popular alternative to other methods .
Tool Manipulation:
- Characteristics: The paper discusses the effectiveness of the JiuZhang3.0 models in utilizing tools by generating programs through synthesizing massive math problems, which can enhance mathematical reasoning capabilities .
- Advantages: The JiuZhang3.0 models outperform baseline methods significantly, showcasing the effectiveness of the approach in tool manipulation tasks .
Pre-training Data Quality:
- Characteristics: The paper indicates the high quality of the synthetic pre-training data used in the approach, demonstrating that the model can adapt better to synthetic data compared to other baseline models .
- Advantages: The performance of the model consistently surpasses the baseline, especially on complex datasets like MATH, showcasing the superiority of the method in improving advanced mathematical reasoning capabilities .
Variation Studies:
- Characteristics: The paper conducts variation studies to verify the effectiveness of key components in the proposed method, such as prompt sets, math-related texts, and boosting techniques .
- Advantages: These studies demonstrate the effectiveness of the components used in the approach, with the original model performing the best among variations, indicating the superiority of the gradient-based strategy employed .
Model Performance:
- Characteristics: The JiuZhang3.0 models, particularly the 8B version, exhibit superior performance compared to other models like DeepSeekMath-7B, showcasing stronger code synthesis and tool capabilities .
- Advantages: The 8B version performs better than the 7B version, indicating advancements in code synthesis and tool utilization, contributing to improved mathematical reasoning capabilities .

These characteristics and advantages highlight the innovative aspects and advancements of the JiuZhang3.0 approach in enhancing mathematical reasoning through efficient training with small data synthesis models, as detailed in the paper.

Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies have been conducted in the field of mathematical reasoning and training small data synthesis models. Noteworthy researchers in this area include Long Long Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zheng Li, Adrian Weller, Weiyang Liu, Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah, Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun, and many others .

The key to the solution mentioned in the paper involves advancing large language model (LLM) reasoning generalists with preference trees, which is a method proposed by Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun . This approach aims to enhance the reasoning capabilities of LLMs by utilizing preference trees, which play a crucial role in improving mathematical reasoning and problem-solving abilities.

How were the experiments in the paper designed?

The experiments in the paper were designed with the following key aspects:

The evaluation framework and examples followed existing work .
For general and math domain base models, the few-shot prompting method was adopted.
For fine-tuned models, the zero-shot prompting method was used for open-ended natural language reasoning and tool manipulation tasks, while the few-shot prompting method was applied for multiple choice problems .

What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the Mathpile-Arxiv dataset, the StackExchange subset of the MMIQC dataset, and the Mathpile-Wikipedia dataset . The code for the research is open source, as it mentions the use of open-source LLMs and provides details on the methodology and datasets used, making it accessible for others to replicate and build upon .

Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide strong support for the scientific hypotheses that needed verification. The study conducted various experiments and analyses to evaluate the effectiveness of different models and strategies in mathematical reasoning . The results demonstrated that the JiuZhang3.0 models outperformed baseline methods significantly, indicating the efficacy of their approach in the given setting . Additionally, the paper conducted a variation study using different existing Large Language Models (LLMs) for data synthesis and found that these models generally performed worse than the original model, especially in solving complex math problems . This suggests that existing LLMs may not be suitable for directly synthesizing data for pre-training to improve performance in solving intricate mathematical problems.

What are the contributions of this paper?

The paper makes several contributions, including:

Advancing LLM reasoning generalists with preference trees .
Introducing the Claude 3 model family: Opus, Sonnet, Haiku .
Unlocking multimodal understanding across millions of tokens of context with Gemini 1.5 .
Providing a tool-integrated reasoning agent for mathematical problem solving with Tora .
Bootstraping mathematical questions for large language models with Metamath .
Unlocking the potential of SLMS in grade school math with Orca-math .
Developing a diverse corpus for evaluating and developing English math word problem solvers .
Conducting an empirical study of data ability boundary in LLMs' math reasoning .
Introducing various large language models for math problem-solving, such as Mistral-7B-MMIQC, MetaMath-Mistral-7B, and Math-Shepherd-Mistral-7B .
Scaling open-source language models with longtermism using DeepSeek LLM .
Improving math problem-solving in large language models with ChatGLM-math .
Fine-tuning models on math instructions and tool-augmented math instructions for enhanced reasoning capabilities .

What work can be continued in depth?

To continue work in depth on improving mathematical reasoning, researchers can focus on several aspects such as:

Optimizing prompt engineering: Further enhancing prompt engineering techniques like chain-of-thought and tree-of-thought can guide Large Language Models (LLMs) for better performance on mathematical reasoning tasks .
Continual pre-training: Continued pre-training of LLMs on domain-specific or task-specific corpora can help improve their ability to handle various mathematical problems effectively .
Supervised fine-tuning: Fine-tuning LLMs on math-related instruction datasets can enhance their performance in following specific task instructions, contributing to better mathematical reasoning capabilities .
Exploring other strategies: Researchers can delve into additional strategies like tool augmentation, decoding optimization, and more to further enhance the capability of LLMs for mathematical reasoning tasks .

Introduction

Background

Cost-effective approach in large language model enhancement

Distillation from GPT-4 and its significance

Objective

To improve mathematical reasoning in LLMs with a smaller, affordable model

Aim to reduce costs through data synthesis and efficient training

Method

Data Collection

Distillation from GPT-4

Gradient-based selection of math-related texts

Utilization of GPT-4's knowledge for synthesis

Problem Generation

6 million math problems created with limited API calls

Tailored prompts for different education levels

Data Preprocessing

Synthesized dataset creation

Filtering and selection of relevant math content

Ensuring diversity and quality for model training

Model Training

JiuZhang3.0 Architecture

Description of the smaller, efficient model design

Comparison with larger models in terms of computational requirements

Training Process

Training methodology using the distilled dataset

Optimization for mathematical reasoning tasks

Evaluation

Performance Metrics

Natural language reasoning and tool manipulation tasks

Comparison with existing methods

Results

Outperformance on 18 evaluation datasets

Demonstrated improvement in reasoning and task handling

Cost Savings and Accessibility

Affordability of the JiuZhang3.0 approach

Public availability of code and data

Conclusion

Summary of the method's effectiveness and benefits

Potential implications for future LLM development and education applications

Basic info

papers

computation and language

artificial intelligence

Advanced features

Insights

What is the primary focus of JiuZhang3.0?

How does JiuZhang3.0 compare to existing methods in terms of performance on evaluation datasets?

What is the key cost-saving method used in JiuZhang3.0's training process?

How does JiuZhang3.0 enhance mathematical reasoning in large language models?