Related papers: Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

URL: http://arxiv.org/abs/2510.10787v1
Date: Sun, 12 Oct 2025 20:09:07 GMT
Title: Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG
Authors: Zhichao Wang, Cheng Wan, Dong Nie,
Abstract summary: Performance gains of LLMs have historically been driven by scaling up model size and training data.<n>The rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck.<n>This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling.
Score: 13.772025442106544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

Related papers

Multi-hop Reasoning via Early Knowledge Alignment [68.28168992785896]
Early Knowledge Alignment (EKA) aims to align Large Language Models with contextually relevant retrieved knowledge.<n>EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency.<n>EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models.
arXiv Detail & Related papers (2025-12-23T08:14:44Z)
LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning [20.48365890565577]
We propose a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length.<n>We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness.
arXiv Detail & Related papers (2025-10-01T20:57:22Z)
Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval [5.640810636056805]
MoLER is a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval.<n>MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
arXiv Detail & Related papers (2025-09-08T13:04:07Z)
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning [53.85659415230589]
This paper systematically reviews widely adoptedReinforcement learning techniques.<n>We present clear guidelines for selecting RL techniques tailored to specific setups.<n>We also reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies.
arXiv Detail & Related papers (2025-08-11T17:39:45Z)
Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies [66.83950068218033]
Scaling Laws demonstrate that scaling model parameters and training data enhances learning performance.<n>Despite its potential to improve performance, the integration of scaling laws into deep reinforcement learning has not been fully realized.<n>This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget.
arXiv Detail & Related papers (2025-08-05T08:03:12Z)
Taming the Titans: A Survey of Efficient LLM Inference Serving [33.65474967178607]
Large Language Models (LLMs) for Generative AI have achieved remarkable progress.<n>The substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges.<n>Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field.
arXiv Detail & Related papers (2025-04-28T12:14:02Z)
Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning [19.64843401617767]
In this paper, we introduce HRLFS, a reinforcement learning-based subspace exploration strategy for complex datasets.<n>We show that HRLFS improves the downstream machine learning performance with iterative feature subspace exploration.<n>We also show that HRLFS accelerates total run time by reducing the number of agents involved.
arXiv Detail & Related papers (2025-04-24T08:16:36Z)
Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
Chain-of-Retrieval Augmented Generation [91.02950964802454]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Simultaneously Evolving Deep Reinforcement Learning Models using Multifactorial Optimization [18.703421169342796]
This work proposes a framework capable of simultaneously evolving several DQL models towards solving interrelated Reinforcement Learning tasks. A thorough experimentation is presented and discussed so as to assess the performance of the framework.
arXiv Detail & Related papers (2020-02-25T10:36:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.