Related papers: Spatially Visual Perception for End-to-End Robotic Learning

Spatially Visual Perception for End-to-End Robotic Learning

URL: http://arxiv.org/abs/2411.17458v1
Date: Tue, 26 Nov 2024 14:23:42 GMT
Title: Spatially Visual Perception for End-to-End Robotic Learning
Authors: Travis Davies, Jiahuan Yan, Xiang Chen, Yu Tian, Yueting Zhuang, Yiqi Huang, Luhui Hu,
Abstract summary: We introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data.
Score: 33.490603706207075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.

Related papers

FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings. We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z)
Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles [81.29018359825872]
This paper consolidates a set of good practices to finetune large pretrained models for a real-world task. Specifically, we develop several strategies to account for discrepancies between the synthetic data and real driving data. Our insights lead to effective finetuning that results in a $68.8%$ reduction in FID for novel view synthesis over prior arts.
arXiv Detail & Related papers (2024-12-19T03:39:13Z)
Learning Generalizable 3D Manipulation With 10 Demonstrations [16.502781729164973]
We present a novel framework that learns manipulation skills from as few as 10 demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
arXiv Detail & Related papers (2024-11-15T14:01:02Z)
E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable. We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework. Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z)
3D Hand Mesh Recovery from Monocular RGB in Camera Space [3.0453197258042213]
This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks. We utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks. Our proposed model is comparable with state-of-the-art models.
arXiv Detail & Related papers (2024-05-12T05:36:37Z)
Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs [57.492124844326206]
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision. Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks.
arXiv Detail & Related papers (2023-12-12T13:22:44Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Multi-Modal Dataset Acquisition for Photometrically Challenging Object [56.30027922063559]
This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects. We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets.
arXiv Detail & Related papers (2023-08-21T10:38:32Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models [8.22669535053079]
We study how to learn world models in unconstrained environments over high-dimensional observation spaces such as images. One source of difficulty is the presence of irrelevant but hard-to-model background distractions. We learn a recurrent latent dynamics model which contrastively predicts the next observation. This simple model leads to surprisingly robust robotic control even with simultaneous camera, background, and color distractions.
arXiv Detail & Related papers (2021-12-02T12:15:25Z)
Unadversarial Examples: Designing Objects for Robust Vision [100.4627585672469]
We develop a framework that exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects" We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks to (in-simulation) robotics.
arXiv Detail & Related papers (2020-12-22T18:26:07Z)
3D Scene Geometry-Aware Constraint for Camera Localization with Deep Learning [11.599633757222406]
Recently end-to-end approaches based on convolutional neural network have been much studied to achieve or even exceed 3D-geometry based traditional methods. In this work, we propose a compact network for absolute camera pose regression. Inspired from those traditional methods, a 3D scene geometry-aware constraint is also introduced by exploiting all available information including motion, depth and image contents.
arXiv Detail & Related papers (2020-05-13T04:15:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.