ManiAgent: An Agentic Framework for General Robotic Manipulation
- URL: http://arxiv.org/abs/2510.11660v2
- Date: Tue, 14 Oct 2025 03:03:05 GMT
- Title: ManiAgent: An Agentic Framework for General Robotic Manipulation
- Authors: Yi Yang, Kefan Gu, Yuqing Wen, Hebei Li, Yucheng Zhao, Tiancai Wang, Xudong Liu,
- Abstract summary: We introduce ManiAgent, an agentic architecture for general manipulation tasks.<n>Multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation.<n>ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks.
- Score: 30.154478145473792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inputs to robotic manipulation actions. In this framework, multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation, enabling efficient handling of complex manipulation scenarios. Evaluations show ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks, enabling efficient data collection that yields VLA models with performance comparable to those trained on human-annotated datasets. The project webpage is available at https://yi-yang929.github.io/ManiAgent/.
Related papers
- Demonstration-Free Robotic Control via LLM Agents [0.0]
We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification.<n>With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively.<n>Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning.
arXiv Detail & Related papers (2026-01-28T07:49:35Z) - AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation [24.199522837278128]
We present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning.<n>ATARA is a scalable self-supervised framework that accelerates collection by over $ 30times $ compared to human teleoperation.<n>We propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder.
arXiv Detail & Related papers (2025-07-17T03:48:57Z) - AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents [60.881609323604685]
Agent Synth is a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets.<n>Our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.
arXiv Detail & Related papers (2025-06-17T05:46:52Z) - OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z) - Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models [49.4824734958566]
Chain-of-Modality (CoM) enables Vision Language Models to reason about multimodal human demonstration data.<n>CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt.
arXiv Detail & Related papers (2025-04-17T21:31:23Z) - Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations.
A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability.
An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z) - MobileAgent: enhancing mobile control via human-machine interaction and
SOP integration [0.0]
Large Language Models (LLMs) are now capable of automating mobile device operations for users.
Privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation.
We have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs.
Our approach is evaluated on the new device control benchmark AitW, which encompasses 30K unique instructions across multi-step tasks.
arXiv Detail & Related papers (2024-01-04T03:44:42Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - RoboAgent: Generalization and Efficiency in Robot Manipulation via
Semantic Augmentations and Action Chunking [54.776890150458385]
We develop an efficient system for training universal agents capable of multi-task manipulation skills.
We are able to train a single agent capable of 12 unique skills, and demonstrate its generalization over 38 tasks.
On average, RoboAgent outperforms prior methods by over 40% in unseen situations.
arXiv Detail & Related papers (2023-09-05T03:14:39Z) - CLAS: Coordinating Multi-Robot Manipulation with Central Latent Action
Spaces [9.578169216444813]
This paper proposes an approach to coordinating multi-robot manipulation through learned latent action spaces that are shared across different agents.
We validate our method in simulated multi-robot manipulation tasks and demonstrate improvement over previous baselines in terms of sample efficiency and learning performance.
arXiv Detail & Related papers (2022-11-28T23:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.