Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

Bowen Zuo, Yinglun Zhu·June 15, 2025

Summary

A method using bandit learning for strategic scaling in large language models boosts performance by 15.04% on MATH-500 and 14.40% on LiveCodeBench. Two studies assess LLMs for reasoning tasks, focusing on few-shot fine-tuning and in-context learning. Pimentel et al. (2023) conduct a fair assessment, while Muennighoff et al. (2025) investigate simple test-time scaling. The vLLM model excels in math questions, outperforming others on hard datasets with algorithms like Entropy and UCB.

Introduction

Background

Overview of large language models (LLMs)

Importance of strategic scaling in LLMs

Objective

To explore the use of bandit learning for strategic scaling in LLMs

To demonstrate performance improvements on specific benchmarks

Method

Data Collection

Selection of benchmarks (MATH-500, LiveCodeBench)

Data characteristics and preparation

Data Preprocessing

Preprocessing steps for the benchmarks

Handling of data for bandit learning algorithms

Bandit Learning Application

Explanation of bandit learning principles

Integration of bandit learning for strategic scaling

Implementation details and algorithm selection (Entropy, UCB)

Performance Evaluation

Benchmark Results

Detailed results on MATH-500 and LiveCodeBench

Performance metrics and comparisons

Comparative Analysis

Comparison with other LLM scaling methods

Highlighting the 15.04% and 14.40% improvements

Reasoning Task Studies

Pimentel et al. (2023)

Overview of the study

Fair assessment methodology

Findings and implications

Muennighoff et al. (2025)

Study focus on simple test-time scaling

Methodology and results

Contribution to the field of reasoning tasks

Model Performance

vLLM Model

Description of the vLLM model

Specialization in math questions

Performance on hard datasets

Conclusion

Summary of Findings

Recap of the method's effectiveness

Key performance indicators and benchmarks

Future Directions

Potential areas for further research

Recommendations for practical applications

Basic info

papers

computation and language

machine learning

artificial intelligence

Advanced features

Insights

How does the bandit learning method strategically scale large language models to achieve performance improvements?

What performance gains were observed on the MATH-500 and LiveCodeBench datasets when using bandit learning for scaling?

Which large language model, specifically vLLM, demonstrates superior performance in solving math questions, and what algorithms contribute to its success?

What are the key differences in the assessment methodologies used by Pimentel et al. (2023) and Muennighoff et al. (2025) when evaluating LLMs for reasoning tasks?