Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
- URL: http://arxiv.org/abs/2512.09706v1
- Date: Wed, 10 Dec 2025 14:52:29 GMT
- Title: Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
- Authors: Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang,
- Abstract summary: CrossAgent is a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory.<n>Experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance.
- Score: 42.1534425503333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA
Related papers
- ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas [13.919124676472022]
ASTRA is an end-to-end framework for training tool-augmented language model agents.<n>ASTRA integrates scalable data synthesis and verifiable reinforcement learning.<n> Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance.
arXiv Detail & Related papers (2026-01-29T11:22:23Z) - ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging [51.409102048965394]
Agent-Role Merging (ARM) is an activation-guided, role-conditioned neuron transplantation method for model merging in LLM agents.<n>ARM improves existing merging methods from static natural language tasks to multi-turn agent scenarios.
arXiv Detail & Related papers (2026-01-12T08:31:53Z) - Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback [51.22403664895878]
Agent2World is a tool-augmented multi-agent framework that achieves strong inference-time world-model generation.<n>It also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback.
arXiv Detail & Related papers (2025-12-26T18:54:14Z) - Diffusion Forcing for Multi-Agent Interaction Sequence Modeling [52.769202433667125]
MAGNet is a unified autoregressive diffusion framework for multi-agent motion generation.<n>It supports a wide range of interaction tasks through flexible conditioning and sampling.<n>It captures both tightly synchronized activities and loosely structured social interactions.
arXiv Detail & Related papers (2025-12-19T18:59:02Z) - Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments [0.0]
This paper presents a unified Multi-Agent Reinforcement Learning (MARL) framework that addresses these challenges.<n>To ensure scalability, we integrate mean-field theory into the HRL framework.<n>Our mean-field actor-critic method achieves a significant performance boost.
arXiv Detail & Related papers (2025-12-02T19:11:44Z) - Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control [72.43808515668947]
We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control.<n>Hi-Agent features a high-level reasoning model and a low-level action model that are jointly optimized.<n>Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark.
arXiv Detail & Related papers (2025-10-16T07:38:21Z) - One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning [32.13266149565313]
Multi-task world models like UniZero excel in single-task settings.<n>We find that gradient conflicts and the loss of model plasticity often constrain their sample efficiency.<n>In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process.
arXiv Detail & Related papers (2025-09-09T17:27:53Z) - Towards Agentic AI for Multimodal-Guided Video Object Segmentation [14.877182670778284]
Referring-based Video Object is a multimodal problem that requires producing fine-grained segmentation results guided by external cues.<n>Recent advances in vision-language foundation models open a promising direction toward training-free approaches.<n>We propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner.
arXiv Detail & Related papers (2025-08-14T12:11:15Z) - Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One [28.264011412168347]
Model ensemble is a useful approach in reinforcement learning (RL) for training effective agents.<n>We propose LLM-Ens, a novel approach that enhances RL model ensemble with task-specific semantic understandings.
arXiv Detail & Related papers (2025-05-21T09:35:43Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Agent models: Internalizing Chain-of-Action Generation into Reasoning models [15.954047804223379]
We position emphLarge Agent Models (LAMs) that internalize the generation of emphChain-of-Action (CoA)<n>Our proposed AutoCoA framework combines supervised fine-tuning (SFT) and reinforcement learning (RL)<n>Main components include step-level action triggering, trajectory-level CoA, and an internal world model to reduce real-environment interaction costs.
arXiv Detail & Related papers (2025-03-09T12:19:47Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.