Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning
- URL: http://arxiv.org/abs/2601.20209v1
- Date: Wed, 28 Jan 2026 03:15:34 GMT
- Title: Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning
- Authors: Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao,
- Abstract summary: We propose textbfSpark (textbfStrategic textbfPolicy-textbfAware explotextbfRation via textbfKey-state dynamic branching)<n>Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories.<n>textscSpark achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
- Score: 31.17280303212164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
Related papers
- Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling [29.182538022605627]
Branching Relative Policy Optimization (BranPO) is a value-free method that provides step-level contrastive supervision without dense rewards.<n>BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes.<n>To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions.
arXiv Detail & Related papers (2026-02-03T16:43:09Z) - Verified Critical Step Optimization for LLM Agents [67.05296684575445]
Critical Step Optimization focuses preference learning on verified critical steps.<n>Method starts from failed policy trajectories rather than expert demonstrations.<n>Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline.
arXiv Detail & Related papers (2026-02-03T11:41:02Z) - To Retrieve or To Think? An Agentic Approach for Context Evolution [18.882502245155624]
Agentic Context Evolution (ACE) is a framework inspired by human metacognition that determines whether to seek new evidence or reason with existing knowledge.<n>ACE employs a central orchestrator agent to make decisions strategically via majority voting.<n>Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
arXiv Detail & Related papers (2026-01-13T17:25:57Z) - Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z) - WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning [73.91893534088798]
WebSailor is a complete post-training methodology designed to instill this crucial capability.<n>Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation.<n>WebSailor significantly outperforms all open-source agents in complex information-seeking tasks.
arXiv Detail & Related papers (2025-09-16T17:57:03Z) - Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning [56.496001894673235]
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs)<n>Our analysis reveals that puzzling phenomena like aha moments", length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy.
arXiv Detail & Related papers (2025-09-03T18:52:49Z) - DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning [32.80410217696872]
We propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER)<n>We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target.<n>We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.
arXiv Detail & Related papers (2025-05-26T11:35:07Z) - A Multi-Agent Reinforcement Learning Approach for Cooperative Air-Ground-Human Crowdsensing in Emergency Rescue [22.201769922727077]
This paper tackles the Heterogeneous Collaborative-Sensing Task Allocation problem for emergency rescue, considering humans, UAVs, and UGVs.<n>We introduce a novel Hard-Cooperative'' policy where UGVs prioritize recharging low-battery UAVs, alongside performing their sensing tasks.<n>We propose HECTA4ER, a novel multi-agent reinforcement learning algorithm built upon a Decentralized Execution architecture.
arXiv Detail & Related papers (2025-05-11T14:49:15Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control [26.93901849666341]
In general, feasible regions for accomplishing tasks within complex high-dimensional spaces are exceedingly narrow.<n>We propose the $textbfS$elf-$textbfI$mitative $textbfR$einforcement, where the RL algorithm also imitates potentially task-relevant trajectories.<n>Our proposed algorithm achieves 120% performance improvement on the challenging HumanoidBench with 5% extra computation overhead.
arXiv Detail & Related papers (2025-02-24T16:55:27Z) - Action abstractions for amortized sampling [49.384037138511246]
We propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process.
Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and chunking' them into a single action that is added to the action space.
arXiv Detail & Related papers (2024-10-19T19:22:50Z) - Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning [83.41487567765871]
Skipper is a model-based reinforcement learning framework.
It automatically generalizes the task given into smaller, more manageable subtasks.
It enables sparse decision-making and focused abstractions on the relevant parts of the environment.
arXiv Detail & Related papers (2023-09-30T02:25:18Z) - Semantically Aligned Task Decomposition in Multi-Agent Reinforcement
Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA)
SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning.
SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z) - Hierarchical Reinforcement Learning By Discovering Intrinsic Options [18.041140234312934]
HIDIO can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks.
In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency.
arXiv Detail & Related papers (2021-01-16T20:54:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.