M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation
- URL: http://arxiv.org/abs/2509.14980v1
- Date: Thu, 18 Sep 2025 14:09:53 GMT
- Title: M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation
- Authors: Ju Dong, Lei Zhang, Liding Zhang, Yao Ling, Yu Fu, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang,
- Abstract summary: M4Diffuser is a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP controller for mobile manipulation.<n>Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks.
- Score: 17.9979990426915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Ctrl-World: A Controllable Generative World Model for Robot Manipulation [53.71061464925014]
Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
arXiv Detail & Related papers (2025-10-11T09:13:10Z) - Grasp-MPC: Closed-Loop Visual Grasping via Value-Guided Model Predictive Control [24.588260602136867]
We propose Grasp-MPC, a closed-loop vision-based grasping policy for novel objects in cluttered environments.<n> Grasp-MPC incorporates a value function, trained on visual observations from a large-scale synthetic dataset of 2 million grasp trajectories.<n>We evaluate Grasp-MPC on FetchBench and real-world settings across diverse environments.
arXiv Detail & Related papers (2025-09-07T20:28:21Z) - OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z) - Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [82.27842884709378]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z) - Meta-Control: Automatic Model-based Control Synthesis for Heterogeneous Robot Skills [10.43221469116584]
We propose Meta-Control, which creates customized state representations and control strategies tailored to specific tasks.<n>Our core insight is that a meta-control system can be built to automate the thought process that human experts use to design control systems.
arXiv Detail & Related papers (2024-05-18T19:58:44Z) - Learning Model Predictive Controllers with Real-Time Attention for
Real-World Navigation [34.86856430694435]
We present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints.
Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers.
Compared with a standard MPC policy, Performer-MPC achieves >40% better goal reached in cluttered environments and >65% better on social metrics when navigating around humans.
arXiv Detail & Related papers (2022-09-22T04:57:58Z) - Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform.
We produce a closed-loop controller to reactively push objects in a continuous action space.
We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z) - Articulated Object Interaction in Unknown Scenes with Whole-Body Mobile
Manipulation [16.79185733369416]
We propose a two-stage architecture for autonomous interaction with large articulated objects in unknown environments.
The first stage uses a learned model to estimate the articulated model of a target object from an RGB-D input and predicts an action-conditional sequence of states for interaction.
The second stage comprises of a whole-body motion controller to manipulate the object along the generated kinematic plan.
arXiv Detail & Related papers (2021-03-18T21:32:18Z) - Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill
Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner.
Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion.
We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.