Related papers: LIMO: Less is More for Reasoning

LIMO: Less is More for Reasoning

URL: http://arxiv.org/abs/2502.03387v3
Date: Tue, 29 Jul 2025 16:23:02 GMT
Title: LIMO: Less is More for Reasoning
Authors: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu,
Abstract summary: We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples.<n>Our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500.<n>LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks.
Score: 23.312893016642096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as "cognitive templates" that guide reasoning.

Related papers

Scaling Reasoning can Improve Factuality in Large Language Models [7.184302333801519]
We thoroughly examine large language model (LLM) reasoning within complex open-domain question-answering (QA) scenarios.<n>To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces.<n>Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts.
arXiv Detail & Related papers (2025-05-16T11:39:33Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.<n>We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)<n>We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.<n>Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study [16.441081996257576]
This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning strategies can substantially improve reasoning performance. We show that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. We will open-source our datasets on GitHub and Hugging Face.
arXiv Detail & Related papers (2025-04-01T14:18:38Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [0.8030359871216614]
We evaluate state-of-the-art reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency. UPFT removes the need for labeled data or exhaustive sampling. Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning [55.6623318085391]
Recent large language model (LLM) reasoning suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning. We propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model.
arXiv Detail & Related papers (2025-03-03T15:20:41Z)
LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.<n> SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.<n>We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Patience Is The Key to Large Language Model Reasoning [0.0]
We propose a simple method by encouraging models to adopt a more patient reasoning style.<n>We generate detailed reasoning processes as positive examples and simple answers as negative examples, thereby training the model to favor thoroughness in its responses.<n>Our results demonstrate a performance increase of up to 2.1% on GSM8k with training just on a lightweight dataset.
arXiv Detail & Related papers (2024-11-20T07:20:48Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities. The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z)
Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes [55.2326738851157]
We introduce Enhancing NumeriCal reasOning with Reliable procEsses (Encore), which derives the reliable reasoning process by decomposing the answer formula. We present a series of pre-training tasks to help models learn the reasoning process generation with synthesized data. Experiments show that Encore yields improvement on all five experimental datasets with an average of 1.8%.
arXiv Detail & Related papers (2024-02-16T13:02:11Z)
Learn to Accumulate Evidence from All Training Samples: Theory and Practice [7.257751371276488]
Evidential deep learning offers a principled and computationally efficient way to turn a deterministic neural network uncertainty-aware. Existing evidential activation functions create zero evidence regions, which prevent the model to learn from training samples falling into such regions. A deeper analysis of evidential activation functions based on our theoretical underpinning inspires the design of a novel regularizer.
arXiv Detail & Related papers (2023-06-19T18:27:12Z)
LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.