Related papers: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

URL: http://arxiv.org/abs/2502.06703v1
Date: Mon, 10 Feb 2025 17:30:23 GMT
Title: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Authors: Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou,
Abstract summary: Test-Time Scaling is an important method for improving the performance of Large Language Models.<n>This paper focuses on two core questions: What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels?<n>We show that with our compute-optimal TTS strategy, extremely small policy models can outperform larger models.
Score: 69.57918638435491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

Related papers

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
How to Train Your LLM Web Agent: A Statistical Diagnosis [102.04125085041473]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL [62.984693936073974]
Large language models (LLMs) excel in tasks like question answering and dialogue.<n>Complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning.<n>We propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents.
arXiv Detail & Related papers (2025-05-23T16:51:54Z)
How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study [16.441081996257576]
This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning strategies can substantially improve reasoning performance. We show that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. We will open-source our datasets on GitHub and Hugging Face.
arXiv Detail & Related papers (2025-04-01T14:18:38Z)
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [55.330813919992465]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z)
Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving [55.895917967408586]
Existing approaches to mathematical reasoning with large language models rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. We propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously.
arXiv Detail & Related papers (2025-02-17T16:56:23Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B [11.832907585157638]
This paper evaluates the performance of instruction-tuned LLMs on models ranging from 7B to 405B. We assess performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue.
arXiv Detail & Related papers (2024-09-17T10:31:37Z)
Online Cascade Learning for Efficient Inference over Streams [9.516197133796437]
Large Language Models (LLMs) have a natural role in answering complex queries about data streams. We propose online cascade learning, the first approach to address this challenge. We formulate the task of learning cascades online as an imitation-learning problem.
arXiv Detail & Related papers (2024-02-07T01:46:50Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
Scaling Sentence Embeddings with Large Language Models [43.19994568210206]
In this work, we propose an in-context learning-based method aimed at improving sentence embeddings performance. Our approach involves adapting the previous prompt-based representation method for autoregressive models. By scaling model size, we find scaling to more than tens of billion parameters harms the performance on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-07-31T13:26:03Z)
TART: A plug-and-play Transformer module for task-agnostic reasoning [38.84903599406189]
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. Traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. We propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module.
arXiv Detail & Related papers (2023-06-13T04:37:00Z)
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs) We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.