MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

Zhenwen Liang, Dian Yu, Wenhao Yu, Wenlin Yao, Zhihan Zhang, Xiangliang Zhang, Dong Yu·May 29, 2024

Summary

This paper presents the MathChat benchmark, a tool to evaluate large language models' ability to handle multi-turn and interactive mathematical tasks. It addresses the gap in evaluating models' performance in multi-round math reasoning, where existing models struggle with dialogue understanding and complex problem-solving. MathChat includes tasks like follow-up QA, error correction, and problem generation, designed to assess problem-solving, reasoning, and instruction-following skills. The study finds that math-specific models initially excel but underperform in long-context tasks, while general-purpose models like GPT-3.5 show promise. The MathChatsync dataset, created by sampling and generating math dialogues, is introduced to improve math capabilities through fine-tuning. This research highlights the need for better conversational training and the potential of synthetic data in enhancing LLMs for math-related applications.

Key findings

6

Advanced features