Related papers: AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

URL: http://arxiv.org/abs/2602.10698v1
Date: Wed, 11 Feb 2026 09:57:32 GMT
Title: AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models
Authors: Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, F. Richard Yu,
Abstract summary: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control.<n>We propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations.
Score: 42.57469056850227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

Related papers

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering [50.675710727721786]
We propose GauSSmart, a hybrid method that bridges 2D foundational models and 3D Gaussian Splatting reconstruction.<n>Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision.<n>We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting.
arXiv Detail & Related papers (2025-10-16T03:38:26Z)
Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z)
Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.<n>UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance [8.07701188057789]
We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data.<n>Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues.<n>Our method achieves up to 85% of the fully-supervised performance using only 10% labeled data.
arXiv Detail & Related papers (2024-08-21T12:13:18Z)
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z)
Scalable Vision-Based 3D Object Detection and Monocular Depth Estimation for Autonomous Driving [5.347428263669927]
This dissertation is a multifaceted contribution to the advancement of vision-based 3D perception technologies. In the first segment, the thesis introduces structural enhancements to both monocular and stereo 3D object detection algorithms. The second segment is devoted to data-driven strategies and their real-world applications in 3D vision detection.
arXiv Detail & Related papers (2024-03-04T13:42:54Z)
LiDAR-Based 3D Object Detection via Hybrid 2D Semantic Scene Generation [38.38852904444365]
This paper proposes a novel scene representation that encodes both the semantics and geometry of the 3D environment in 2D. Our simple yet effective design can be easily integrated into most state-of-the-art 3D object detectors.
arXiv Detail & Related papers (2023-04-04T04:05:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.