Multi-Modal Manipulation via Multi-Modal Policy Consensus
- URL: http://arxiv.org/abs/2509.23468v1
- Date: Sat, 27 Sep 2025 19:43:04 GMT
- Title: Multi-Modal Manipulation via Multi-Modal Policy Consensus
- Authors: Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell,
- Abstract summary: We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
- Score: 62.49978559936122
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
Related papers
- Flexible Multitask Learning with Factorized Diffusion Policy [59.526246520933135]
Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions.<n>Existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation.<n>We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models.
arXiv Detail & Related papers (2025-12-26T07:11:47Z) - TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception [8.939880394166348]
We propose a robust multimodal fusion framework, TouchFormer.<n>We employ a Modality-Adaptive Gating mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features.<n>We show that TouchFormer achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and subcategory tasks, respectively.
arXiv Detail & Related papers (2025-11-24T00:43:59Z) - ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation [46.06124092071133]
We propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation.<n>Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2025-09-25T07:29:07Z) - HeLoFusion: An Efficient and Scalable Encoder for Modeling Heterogeneous and Multi-Scale Interactions in Trajectory Prediction [11.30785902722196]
HeLoFusion is an efficient and scalable encoder for modeling heterogeneous and multi-scale agent interactions.<n>Our work demonstrates that a locality-grounded architecture, which explicitly models multi-scale and heterogeneous interactions, is a highly effective strategy for advancing motion forecasting.
arXiv Detail & Related papers (2025-09-15T09:19:41Z) - Deformable Cluster Manipulation via Whole-Arm Policy Learning [27.54191389134963]
We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators.<n>Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference.<n>We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion.
arXiv Detail & Related papers (2025-07-22T23:58:30Z) - Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address the challenge of sampling and remote estimation for autoregressive Markovian processes in a wireless network with statistically-identical agents.<n>Our goal is to minimize time-average estimation error and/or age of information with decentralized scalable sampling and transmission policies.
arXiv Detail & Related papers (2024-04-04T06:24:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - SFusion: Self-attention based N-to-One Multimodal Fusion Block [6.059397373352718]
We propose a self-attention based fusion block called SFusion.
It learns to fuse available modalities without synthesizing or zero-padding missing ones.
In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks.
arXiv Detail & Related papers (2022-08-26T16:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.