Learning to Better Search with Language Models via Guided Reinforced Self-Training
- URL: http://arxiv.org/abs/2410.02992v2
- Date: Mon, 27 Oct 2025 04:46:45 GMT
- Title: Learning to Better Search with Language Models via Guided Reinforced Self-Training
- Authors: Seungyong Moon, Bumsoo Park, Hyun Oh Song,
- Abstract summary: We propose guided self-training (Guided-ReST) to improve the model's capability for effective search during inference.<n>Guided-ReST incorporates optimal solutions into the model's search procedure, enabling the generation of high-quality search traces.<n>Our method significantly enhances the search capabilities of language models on arithmetic reasoning and code self-repair tasks.
- Score: 15.289058352618468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While language models have shown remarkable performance across diverse tasks, they still encounter challenges in complex reasoning scenarios. Recent research suggests that language models trained on linearized search traces toward solutions, rather than solely on the final solutions, exhibit improved generalization, despite the search traces being potentially noisy or suboptimal. However, relying on such imperfect traces can result in inefficient use of test-time compute. To address this, we propose guided reinforced self-training (Guided-ReST), a fine-tuning algorithm designed to improve the model's capability for effective search during inference. The key insight behind Guided-ReST is that optimal solutions can serve as valuable step-by-step landmarks to guide the model's search process. Based on this insight, we introduce a novel data generation method that seamlessly incorporates optimal solutions into the model's search procedure, enabling the generation of high-quality search traces. By fine-tuning the model on these search traces, we effectively distill improved search strategies into the model. Our method significantly enhances the search capabilities of language models on arithmetic reasoning and code self-repair tasks, including Countdown, CodeContests, and CodeForces. We release the source code at https://github.com/snu-mllab/guided-rest.
Related papers
- ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context [66.15505423059234]
We introduce ASTRO, a framework for training language models to reason like search algorithms.<n>We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.4% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024.
arXiv Detail & Related papers (2025-07-01T04:10:15Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - Offline Learning and Forgetting for Reasoning with Large Language Models [23.384882158333156]
We propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful and failed reasoning paths.<n>Experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines.<n>Our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
arXiv Detail & Related papers (2025-04-15T16:30:02Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.
Our approach only leverages a small number of samples to search for the desired pruning policy.
We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z) - World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [60.100794160682646]
We propose a new learning framework that jointly optimize state prediction and action selection through preference learning.
To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error.
Our method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B)
arXiv Detail & Related papers (2025-03-13T15:49:56Z) - Automatic Prompt Optimization via Heuristic Search: A Survey [13.332569343755075]
Large Language Models have led to remarkable achievements across a variety of Natural Language Processing tasks.
While manual methods can be effective, they typically rely on intuition and do not automatically refine prompts over time.
automatic prompt optimization employing-based search algorithms can systematically explore and improve prompts with minimal human oversight.
arXiv Detail & Related papers (2025-02-26T01:42:08Z) - Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Learning Joint Models of Prediction and Optimization [56.04498536842065]
Predict-Then-Then framework uses machine learning models to predict unknown parameters of an optimization problem from features before solving.
This paper proposes an alternative method, in which optimal solutions are learned directly from the observable features by joint predictive models.
arXiv Detail & Related papers (2024-09-07T19:52:14Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Beyond Training: Optimizing Reinforcement Learning Based Job Shop Scheduling Through Adaptive Action Sampling [10.931466852026663]
We investigate the optimal use of trained deep reinforcement learning (DRL) agents during inference.
Our work is based on the hypothesis that, similar to search algorithms, the utilization of trained DRL agents should be dependent on the acceptable computational budget.
We propose an algorithm for obtaining the optimal parameterization for such a given number of solutions and any given trained agent.
arXiv Detail & Related papers (2024-06-11T14:59:18Z) - Stream of Search (SoS): Learning to Search in Language [29.841835308845948]
We show how language models can be taught to search by representing the process of search in language as a flattened string.
We propose a unified language for search that captures an array of different symbolic search strategies.
Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
arXiv Detail & Related papers (2024-04-01T06:50:52Z) - PathFinder: Guided Search over Multi-Step Reasoning Paths [80.56102301441899]
We propose PathFinder, a tree-search-based reasoning path generation approach.
It enhances diverse branching and multi-hop reasoning through the integration of dynamic decoding.
Our model generalizes well to longer, unseen reasoning chains, reflecting similar complexities to beam search with large branching factors.
arXiv Detail & Related papers (2023-12-08T17:05:47Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - The Wisdom of Hindsight Makes Language Models Better Instruction
Followers [84.9120606803906]
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback.
In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions.
arXiv Detail & Related papers (2023-02-10T12:16:38Z) - Online Control of Adaptive Large Neighborhood Search using Deep Reinforcement Learning [4.374837991804085]
We introduce a Deep Reinforcement Learning based approach called DR-ALNS that selects operators, adjusts parameters, and controls the acceptance criterion throughout the search.
We evaluate the proposed method on a problem with orienteering weights and time windows, as presented in an IJCAI competition.
The results show that our approach outperforms vanilla ALNS, ALNS tuned with Bayesian optimization, and two state-of-the-art DRL approaches.
arXiv Detail & Related papers (2022-11-01T21:33:46Z) - Do Current Multi-Task Optimization Methods in Deep Learning Even Help? [35.27168056803643]
We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches.
We highlight alternative strategies that consistently yield improvements to the performance profile and point out common training pitfalls that might cause suboptimal results.
arXiv Detail & Related papers (2022-09-23T02:45:13Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Sample-Efficient, Exploration-Based Policy Optimisation for Routing
Problems [2.6782615615913348]
This paper presents a new reinforcement learning approach that is based on entropy.
In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return.
We show that our model can generalise to various route problems.
arXiv Detail & Related papers (2022-05-31T09:51:48Z) - Enabling arbitrary translation objectives with Adaptive Tree Search [23.40984370716434]
We introduce an adaptive tree search algorithm that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective.
Our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models.
arXiv Detail & Related papers (2022-02-23T11:48:26Z) - Efficient Active Search for Combinatorial Optimization Problems [1.6543719822033436]
We show that (efficient) active search enables learned models to effectively solve instances that are much larger than those seen during training.
The proposed methods offer a simple way to significantly improve the search performance of a given model and outperform state-of-the-art machine learning based methods on routing problems.
arXiv Detail & Related papers (2021-06-09T15:08:03Z) - Gradient Vaccine: Investigating and Improving Multi-task Optimization in
Massively Multilingual Models [63.92643612630657]
This paper attempts to peek into the black-box of multilingual optimization through the lens of loss function geometry.
We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with language proximity.
We derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks.
arXiv Detail & Related papers (2020-10-12T17:26:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.