Related papers: Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

URL: http://arxiv.org/abs/2511.14427v1
Date: Tue, 18 Nov 2025 12:32:23 GMT
Title: Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Authors: Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki,
Abstract summary: MultiSensory Dynamic Pretraining (MSDP) is a framework for learning expressive multisensory representations tailored for task-oriented policy learning.<n>MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings.<n>For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings.
Score: 10.782934021703783
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.

Related papers

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z)
Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation [21.78866976181311]
See-through-skin (STS) sensors combine tactile and visual perception.<n>Existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking.<n>We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction.
arXiv Detail & Related papers (2025-12-10T17:35:13Z)
Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z)
Multimodal Anomaly Detection based on Deep Auto-Encoder for Object Slip Perception of Mobile Manipulation Robots [22.63980025871784]
The proposed framework integrates heterogeneous data streams collected from various robot sensors, including RGB and depth cameras, a microphone, and a force-torque sensor. The integrated data is used to train a deep autoencoder to construct latent representations of the multisensory data that indicate the normal status. Anomalies can then be identified by error scores measured by the difference between the trained encoder's latent values and the latent values of reconstructed input data.
arXiv Detail & Related papers (2024-03-06T09:15:53Z)
Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders. Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
Nonprehensile Planar Manipulation through Reinforcement Learning with Multimodal Categorical Exploration [8.343657309038285]
Reinforcement Learning is a powerful framework for developing such robot controllers. We propose a multimodal exploration approach through categorical distributions, which enables us to train planar pushing RL policies. We show that the learned policies are robust to external disturbances and observation noise, and scale to tasks with multiple pushers.
arXiv Detail & Related papers (2023-08-04T16:55:00Z)
RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential Tasks [45.3746654854308]
We present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN)<n>ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning.<n> Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks.
arXiv Detail & Related papers (2023-06-30T20:35:22Z)
Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance [71.36749876465618]
We describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples. experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world.
arXiv Detail & Related papers (2022-12-19T22:50:40Z)
Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems [79.07468367923619]
We propose a backpropagation-free approach to robotic control through the neuro-cognitive computational framework of neural generative coding (NGC) We design an agent built completely from powerful predictive coding/processing circuits that facilitate dynamic, online learning from sparse rewards. We show that our proposed ActPC agent performs well in the face of sparse (extrinsic) reward signals and is competitive with or outperforms several powerful backprop-based RL approaches.
arXiv Detail & Related papers (2022-09-19T16:49:32Z)
Multi-Robot Collaborative Perception with Graph Neural Networks [6.383576104583731]
We propose a general-purpose Graph Neural Network (GNN) with the main goal to increase, in multi-robot perception tasks. We show that the proposed framework can address multi-view visual perception problems such as monocular depth estimation and semantic segmentation.
arXiv Detail & Related papers (2022-01-05T18:47:07Z)
Transformer-based deep imitation learning for dual-arm robot manipulation [4.717749411286867]
In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions.<n>We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements.<n>A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world.
arXiv Detail & Related papers (2021-08-01T07:42:39Z)
Cognitive architecture aided by working-memory for self-supervised multi-modal humans recognition [54.749127627191655]
The ability to recognize human partners is an important social skill to build personalized and long-term human-robot interactions. Deep learning networks have achieved state-of-the-art results and demonstrated to be suitable tools to address such a task. One solution is to make robots learn from their first-hand sensory data with self-supervision.
arXiv Detail & Related papers (2021-03-16T13:50:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.