Related papers: Multi-Modal Manipulation via Multi-Modal Policy Consensus

Multi-Modal Manipulation via Multi-Modal Policy Consensus

URL: http://arxiv.org/abs/2509.23468v1
Date: Sat, 27 Sep 2025 19:43:04 GMT
Title: Multi-Modal Manipulation via Multi-Modal Policy Consensus
Authors: Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell,
Abstract summary: We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
Score: 62.49978559936122
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Related papers

Flexible Multitask Learning with Factorized Diffusion Policy [59.526246520933135]
Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions.<n>Existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation.<n>We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models.
arXiv Detail & Related papers (2025-12-26T07:11:47Z)
TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception [8.939880394166348]
We propose a robust multimodal fusion framework, TouchFormer.<n>We employ a Modality-Adaptive Gating mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features.<n>We show that TouchFormer achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and subcategory tasks, respectively.
arXiv Detail & Related papers (2025-11-24T00:43:59Z)
ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation [46.06124092071133]
We propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation.<n>Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2025-09-25T07:29:07Z)
HeLoFusion: An Efficient and Scalable Encoder for Modeling Heterogeneous and Multi-Scale Interactions in Trajectory Prediction [11.30785902722196]
HeLoFusion is an efficient and scalable encoder for modeling heterogeneous and multi-scale agent interactions.<n>Our work demonstrates that a locality-grounded architecture, which explicitly models multi-scale and heterogeneous interactions, is a highly effective strategy for advancing motion forecasting.
arXiv Detail & Related papers (2025-09-15T09:19:41Z)
Deformable Cluster Manipulation via Whole-Arm Policy Learning [27.54191389134963]
We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators.<n>Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference.<n>We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion.
arXiv Detail & Related papers (2025-07-22T23:58:30Z)
Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address the challenge of sampling and remote estimation for autoregressive Markovian processes in a wireless network with statistically-identical agents.<n>Our goal is to minimize time-average estimation error and/or age of information with decentralized scalable sampling and transmission policies.
arXiv Detail & Related papers (2024-04-04T06:24:11Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
SFusion: Self-attention based N-to-One Multimodal Fusion Block [6.059397373352718]
We propose a self-attention based fusion block called SFusion. It learns to fuse available modalities without synthesizing or zero-padding missing ones. In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks.
arXiv Detail & Related papers (2022-08-26T16:42:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.