Spatially Visual Perception for End-to-End Robotic Learning
- URL: http://arxiv.org/abs/2411.17458v1
- Date: Tue, 26 Nov 2024 14:23:42 GMT
- Title: Spatially Visual Perception for End-to-End Robotic Learning
- Authors: Travis Davies, Jiahuan Yan, Xiang Chen, Yu Tian, Yueting Zhuang, Yiqi Huang, Luhui Hu,
- Abstract summary: We introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability.
Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data.
- Score: 33.490603706207075
- License:
- Abstract: Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.
Related papers
- Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.
To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.
Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles [81.29018359825872]
This paper consolidates a set of good practices to finetune large pretrained models for a real-world task.
Specifically, we develop several strategies to account for discrepancies between the synthetic data and real driving data.
Our insights lead to effective finetuning that results in a $68.8%$ reduction in FID for novel view synthesis over prior arts.
arXiv Detail & Related papers (2024-12-19T03:39:13Z) - Learning Generalizable 3D Manipulation With 10 Demonstrations [16.502781729164973]
We present a novel framework that learns manipulation skills from as few as 10 demonstrations.
We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems.
This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
arXiv Detail & Related papers (2024-11-15T14:01:02Z) - E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - 3D Hand Mesh Recovery from Monocular RGB in Camera Space [3.0453197258042213]
This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks.
We utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks.
Our proposed model is comparable with state-of-the-art models.
arXiv Detail & Related papers (2024-05-12T05:36:37Z) - Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs [57.492124844326206]
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision.
Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks.
arXiv Detail & Related papers (2023-12-12T13:22:44Z) - Multi-Modal Dataset Acquisition for Photometrically Challenging Object [56.30027922063559]
This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects.
We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets.
arXiv Detail & Related papers (2023-08-21T10:38:32Z) - Robust Robotic Control from Pixels using Contrastive Recurrent
State-Space Models [8.22669535053079]
We study how to learn world models in unconstrained environments over high-dimensional observation spaces such as images.
One source of difficulty is the presence of irrelevant but hard-to-model background distractions.
We learn a recurrent latent dynamics model which contrastively predicts the next observation.
This simple model leads to surprisingly robust robotic control even with simultaneous camera, background, and color distractions.
arXiv Detail & Related papers (2021-12-02T12:15:25Z) - Unadversarial Examples: Designing Objects for Robust Vision [100.4627585672469]
We develop a framework that exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects"
We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks to (in-simulation) robotics.
arXiv Detail & Related papers (2020-12-22T18:26:07Z) - 3D Scene Geometry-Aware Constraint for Camera Localization with Deep
Learning [11.599633757222406]
Recently end-to-end approaches based on convolutional neural network have been much studied to achieve or even exceed 3D-geometry based traditional methods.
In this work, we propose a compact network for absolute camera pose regression.
Inspired from those traditional methods, a 3D scene geometry-aware constraint is also introduced by exploiting all available information including motion, depth and image contents.
arXiv Detail & Related papers (2020-05-13T04:15:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.