Related papers: Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

URL: http://arxiv.org/abs/2601.21699v1
Date: Thu, 29 Jan 2026 13:31:28 GMT
Title: Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Authors: Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang,
Abstract summary: We show that small language models can achieve strong multi-hop reasoning under resource constraints.<n>We introduce DAVID-GRPO, a budget-efficient RL framework that stabilizes early learning with minimal supervision.
Score: 36.29651446001057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

Related papers

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents [75.67445299298949]
AgentCPM-Explore is a compact 4B agent model with high knowledge density and strong exploration capability.<n>We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement.<n>AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks.
arXiv Detail & Related papers (2026-02-06T08:24:59Z)
D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use [17.99381644283042]
Large reasoning models (LRMs) lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning.<n>We propose a two-stage training framework that incentivizes LRMs' task decomposition reasoning capability via self-distillation and diversity-aware reinforcement learning.<n>D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales.
arXiv Detail & Related papers (2026-02-02T14:36:15Z)
Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement [21.073482007189504]
Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks.<n> reinforcement learning under verifiable rewards (RLVR) is emerging as a principled framework for aligning model behavior with reasoning chains.<n>Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.
arXiv Detail & Related papers (2026-01-31T16:51:50Z)
Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration [8.839121572048018]
We propose RAPO, an algorithm to promote broader yet focused exploration.<n>We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset.<n>Results show that RAPO consistently improves problem-solving performance.
arXiv Detail & Related papers (2025-10-04T16:22:19Z)
Compass-Thinker-7B Technical Report [8.496143273813718]
We propose the Compass-Thinker-7B model to explore the potential of Reinforcement Learning with less computational resources and costs.<n> Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline.<n>We show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model.
arXiv Detail & Related papers (2025-08-12T12:58:12Z)
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [111.1749164063616]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty [21.96443267949563]
Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval.<n>These systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information)<n>This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems.
arXiv Detail & Related papers (2025-05-22T20:57:56Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [0.0]
Our study investigates the potential of reinforcement learning to improve reasoning in small language models (LLMs)<n>Training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours resulted in rapid reasoning gains.<n>These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches.
arXiv Detail & Related papers (2025-03-20T15:13:23Z)
Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework [1.5802986215292307]
Language Model Guided reward Tuning (LMGT) is a novel, sample-efficient framework for Reinforcement Learning.<n>We show that LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior.<n>Our results suggest that LMGT can substantially reduce the computational resources required during the RL training phase.
arXiv Detail & Related papers (2024-09-07T07:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.