Related papers: Unified Multimodal Diffusion Forcing for Forceful Manipulation

Unified Multimodal Diffusion Forcing for Forceful Manipulation

URL: http://arxiv.org/abs/2511.04812v1
Date: Thu, 06 Nov 2025 21:08:35 GMT
Title: Unified Multimodal Diffusion Forcing for Forceful Manipulation
Authors: Zixuan Huang, Huaidian Hou, Dmitry Berenson,
Abstract summary: We propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories.<n>Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory.<n>We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments.
Score: 13.51688687815195
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website https://unified-df.github.io

Related papers

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z)
4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration [31.111439909825627]
Existing methods typically model the dataset's action distribution using simple observations as inputs.<n>We propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to these sources of chaos.<n>Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
arXiv Detail & Related papers (2025-06-27T14:09:29Z)
MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis [32.08769443927576]
We propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning.<n>MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions.<n>Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step.<n>MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS
arXiv Detail & Related papers (2025-06-23T17:59:06Z)
Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.<n>By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.<n>On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z)
Diffusion Imitation from Observation [4.205946699819021]
adversarial imitation learning approaches learn a generator agent policy to produce state transitions that are indistinguishable to a discriminator. Motivated by the recent success of diffusion models in generative modeling, we propose to integrate a diffusion model into the adversarial imitation learning from observation framework.
arXiv Detail & Related papers (2024-10-07T18:49:55Z)
Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks. In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space. We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z)
Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF) It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z)
Meta-models for transfer learning in source localisation [3.8922067105369154]
This work looks to capture the interdependencies between acoustic emission (AE) experiments (as meta-models) We utilise a Bayesian multilevel approach where a higher level meta-model captures the inter-task relationships. Key contribution is how knowledge of the experimental campaign can be encoded between tasks as well as within tasks.
arXiv Detail & Related papers (2023-05-15T14:02:35Z)
Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity Recognition [34.424960016807795]
Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well. Deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features has not been fully explored. A knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently.
arXiv Detail & Related papers (2023-05-05T19:26:06Z)
EnfoMax: Domain Entropy and Mutual Information Maximization for Domain Generalized Face Anti-spoofing [0.0]
Face anti-spoofing (FAS) method performs well under intra-domain setups. The domain generalization (DG) method has gained more attention in FAS. This paper proposes the EnfoMax framework, which uses information theory to analyze cross-domain FAS tasks.
arXiv Detail & Related papers (2023-02-17T03:54:18Z)
Multi-Branch Deep Radial Basis Function Networks for Facial Emotion Recognition [80.35852245488043]
We propose a CNN based architecture enhanced with multiple branches formed by radial basis function (RBF) units. RBF units capture local patterns shared by similar instances using an intermediate representation. We show it is the incorporation of local information what makes the proposed model competitive.
arXiv Detail & Related papers (2021-09-07T21:05:56Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.