Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
- URL: http://arxiv.org/abs/2601.21699v1
- Date: Thu, 29 Jan 2026 13:31:28 GMT
- Title: Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
- Authors: Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang,
- Abstract summary: We show that small language models can achieve strong multi-hop reasoning under resource constraints.<n>We introduce DAVID-GRPO, a budget-efficient RL framework that stabilizes early learning with minimal supervision.
- Score: 36.29651446001057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.
Related papers
- AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z) - D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use [17.99381644283042]
Large reasoning models (LRMs) lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning.<n>We propose a two-stage training framework that incentivizes LRMs' task decomposition reasoning capability via self-distillation and diversity-aware reinforcement learning.<n>D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales.
arXiv Detail & Related papers (2026-02-02T14:36:15Z) - Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z) - Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z) - Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration [8.839121572048018]
We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
arXiv Detail & Related papers (2025-10-04T16:22:19Z) - Compass-Thinker-7B Technical Report [8.496143273813718]
We propose the Compass-Thinker-7B model to explore the potential of Reinforcement Learning with less computational resources and costs.<n> Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline.<n>We show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model.
arXiv Detail & Related papers (2025-08-12T12:58:12Z) - RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty [21.96443267949563]
Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval.<n>These systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information)<n>This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems.
arXiv Detail & Related papers (2025-05-22T20:57:56Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [0.0]
Our study investigates the potential of reinforcement learning to improve reasoning in small language models (LLMs)<n>Training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours resulted in rapid reasoning gains.<n>These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches.
arXiv Detail & Related papers (2025-03-20T15:13:23Z) - Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework [1.5802986215292307]
Language Model Guided reward Tuning (LMGT) is a novel, sample-efficient framework for Reinforcement Learning.<n>We show that LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior.<n>Our results suggest that LMGT can substantially reduce the computational resources required during the RL training phase.
arXiv Detail & Related papers (2024-09-07T07:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.