Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer
- URL: http://arxiv.org/abs/2501.16389v2
- Date: Sun, 07 Sep 2025 01:22:45 GMT
- Title: Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer
- Authors: Yash Yardi, Samuel Biruduganti, Lars Ankile,
- Abstract summary: We evaluate the performance of pre-trained vision encoders to address the Sim2Real gap.<n>We show that manipulation-pretrained encoders consistently achieve higher Action Scores.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or Sim2Real distribution shift -- introduced by employing simulation-trained policies in real-world environments -- frequently prevents successful policy transfer. We present an offline framework to evaluate the performance of using large-scale pre-trained vision encoders to address the Sim2Real gap. We examine a diverse collection of encoders, assessing their ability to extract features necessary for robot control (Action Score) while remaining invariant to task-irrelevant environmental variations (Domain Invariance Score). Evaluating 23 encoders, we reveal patterns across architectures, pre-training datasets, and parameter scales. Our findings show that manipulation-pretrained encoders consistently achieve higher Action Scores, CNN-based encoders demonstrate stronger domain invariance than ViTs, and the best-performing models combine both properties, underscoring DIS and AS as complementary predictors of Sim2Real transferability.
Related papers
- URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training [21.855770200309674]
We propose a unified sim-and-real co-training framework for learning generalizable manipulation policies.<n>We show it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate.
arXiv Detail & Related papers (2025-09-23T04:32:53Z) - Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection [108.5042835056188]
This work introduces Agent4FaceForgery to address two fundamental problems.<n>How to capture the diverse intents and iterative processes of human forgery creation.<n>How to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.
arXiv Detail & Related papers (2025-09-16T01:05:01Z) - High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception [3.1508266388327324]
This paper proposes a high-fidelity digital twin (HiFi DT) framework that incorporates real-world background geometry, lane-level road topology, and sensor-specific specifications and placement.<n>Experiments show that the DT-trained model outperforms the equivalent model trained on real data by 4.8%.
arXiv Detail & Related papers (2025-09-03T00:12:58Z) - How to Bridge the Sim-to-Real Gap in Digital Twin-Aided Telecommunication Networks [30.858857240474077]
Training effective artificial intelligence models for telecommunications is challenging due to the scarcity of deployment-specific data.<n>Real data collection is expensive, and available datasets often fail to capture the unique operational conditions and contextual variability of the network environment.<n>Digital twinning provides a potential solution to this problem, as simulators tailored to the current network deployment can generate site-specific data to augment the available training datasets.
arXiv Detail & Related papers (2025-07-09T17:27:51Z) - Sim2Real Transfer for Vision-Based Grasp Verification [7.9471205712560264]
We present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object.<n>Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper.<n>To address the limitations of real-world data capture, we introduce HSR-Grasp Synth, a synthetic dataset designed to simulate diverse grasping scenarios.
arXiv Detail & Related papers (2025-05-05T22:04:12Z) - CARLA2Real: a tool for reducing the sim2real gap in CARLA simulator [2.8978140690127328]
We employ a state-of-the-art approach to enhance the photorealism of simulated data, aligning them with the visual characteristics of real-world datasets.
Based on this, we developed CARLA2Real, an easy-to-use, publicly available tool (plug-in) for the widely used and open-source CARLA simulator.
This tool enhances the output of CARLA in near real-time, achieving a frame rate of 13 FPS, translating it to the visual style and realism of real-world datasets.
arXiv Detail & Related papers (2024-10-23T19:33:30Z) - Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation [16.69742672616517]
We introduce an innovative structured light simulation system, generating both RGB and physically realistic depth images.
We create an RGBD dataset tailored for robotic industrial grasping scenarios.
By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings.
arXiv Detail & Related papers (2024-07-17T09:57:14Z) - Sim-to-Real Transfer of Deep Reinforcement Learning Agents for Online Coverage Path Planning [15.792914346054502]
We tackle the challenge of sim-to-real transfer of reinforcement learning (RL) agents for coverage path planning ( CPP)
We bridge the sim-to-real gap through a semi-virtual environment, including a real robot and real-time aspects, while utilizing a simulated sensor and obstacles.
We find that a high inference frequency allows first-order Markovian policies to transfer directly from simulation, while higher-order policies can be fine-tuned to further reduce the sim-to-real gap.
arXiv Detail & Related papers (2024-06-07T13:24:19Z) - DUSA: Decoupled Unsupervised Sim2Real Adaptation for
Vehicle-to-Everything Collaborative Perception [17.595237664316148]
Vehicle-to-Everything (V2X) collaborative perception is crucial for autonomous driving.
achieving high-precision V2X perception requires a significant amount of annotated real-world data.
We present a new unsupervised sim2real domain adaptation method for V2X collaborative detection named Decoupled Unsupervised Sim2Real Adaptation (DUSA)
arXiv Detail & Related papers (2023-10-12T08:21:17Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from
Simulation to Reality [41.25312194294171]
We propose the first-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT.
Our experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality.
arXiv Detail & Related papers (2023-07-16T03:54:10Z) - Sim2real Transfer Learning for Point Cloud Segmentation: An Industrial
Application Case on Autonomous Disassembly [55.41644538483948]
We present an industrial application case that uses sim2real transfer learning for point cloud data.
We provide insights on how to generate and process synthetic point cloud data.
A novel patch-based attention network is proposed additionally to tackle this problem.
arXiv Detail & Related papers (2023-01-12T14:00:37Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to
Reality [64.51295032956118]
We train a policy that can perform robust dexterous manipulation on an anthropomorphic robot hand.
Our work reaffirms the possibilities of sim-to-real transfer for dexterous manipulation in diverse kinds of hardware and simulator setups.
arXiv Detail & Related papers (2022-10-25T01:51:36Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - TrafficSim: Learning to Simulate Realistic Multi-Agent Behaviors [74.67698916175614]
We propose TrafficSim, a multi-agent behavior model for realistic traffic simulation.
In particular, we leverage an implicit latent variable model to parameterize a joint actor policy.
We show TrafficSim generates significantly more realistic and diverse traffic scenarios as compared to a diverse set of baselines.
arXiv Detail & Related papers (2021-01-17T00:29:30Z) - Point Cloud Based Reinforcement Learning for Sim-to-Real and Partial
Observability in Visual Navigation [62.22058066456076]
Reinforcement Learning (RL) represents powerful tools to solve complex robotic tasks.
RL does not work directly in the real-world, which is known as the sim-to-real transfer problem.
We propose a method that learns on an observation space constructed by point clouds and environment randomization.
arXiv Detail & Related papers (2020-07-27T17:46:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.