To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning
- URL: http://arxiv.org/abs/2504.07052v1
- Date: Wed, 09 Apr 2025 17:12:49 GMT
- Title: To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning
- Authors: Tian Qin, David Alvarez-Melis, Samy Jelassi, Eran Malach,
- Abstract summary: Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation.<n>Despite the growing adoption of sequential search, its advantages over parallel sampling remain poorly understood.<n>We show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains.
- Score: 31.21491548356213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in large language models have significantly improved their reasoning abilities, particularly through techniques involving search and backtracking. Backtracking naturally scales test-time compute by enabling sequential, linearized exploration via long chain-of-thought (CoT) generation. However, this is not the only strategy for scaling test-time compute: parallel sampling with best-of-n selection provides an alternative that generates diverse solutions simultaneously. Despite the growing adoption of sequential search, its advantages over parallel sampling--especially under a fixed compute budget remain poorly understood. In this paper, we systematically compare these two approaches on two challenging reasoning tasks: CountDown and Sudoku. Surprisingly, we find that sequential search underperforms parallel sampling on CountDown but outperforms it on Sudoku, suggesting that backtracking is not universally beneficial. We identify two factors that can cause backtracking to degrade performance: (1) training on fixed search traces can lock models into suboptimal strategies, and (2) explicit CoT supervision can discourage "implicit" (non-verbalized) reasoning. Extending our analysis to reinforcement learning (RL), we show that models with backtracking capabilities benefit significantly from RL fine-tuning, while models without backtracking see limited, mixed gains. Together, these findings challenge the assumption that backtracking universally enhances LLM reasoning, instead revealing a complex interaction between task structure, training data, model scale, and learning paradigm.
Related papers
- Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models.
We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models.
For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation [40.861314212279474]
We study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process.<n>We prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters.<n>We also show that the information gained by search can be utilized to obtain a better reasoning model.
arXiv Detail & Related papers (2025-02-02T18:19:14Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential
Reasoning Ability [29.1826948551409]
AQA-Bench is a novel benchmark to assess the sequential reasoning capabilities of large language models.
We build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search.
Our investigations reveal several interesting findings.
arXiv Detail & Related papers (2024-02-14T18:59:33Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z) - Cascaded Regression Tracking: Towards Online Hard Distractor
Discrimination [202.2562153608092]
We propose a cascaded regression tracker with two sequential stages.
In the first stage, we filter out abundant easily-identified negative candidates.
In the second stage, a discrete sampling based ridge regression is designed to double-check the remaining ambiguous hard samples.
arXiv Detail & Related papers (2020-06-18T07:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.