OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation
- URL: http://arxiv.org/abs/2506.01196v1
- Date: Sun, 01 Jun 2025 22:15:45 GMT
- Title: OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation
- Authors: Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis,
- Abstract summary: 3D-aware policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects.<n>We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies.
- Score: 68.11862866566817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and multi-view RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA projects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/
Related papers
- GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond [56.677984098204696]
multimodal language models are driving the development of 3D Vision-Language Models (VLMs)<n>We propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations.<n>We present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images.
arXiv Detail & Related papers (2025-07-01T15:52:59Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [8.090058633054852]
We introduce a plug-and-play module that implicitly injects 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects [3.9825600707172986]
We present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera.<n>The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions.<n>Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects.
arXiv Detail & Related papers (2025-06-24T12:45:09Z) - 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks [19.026406684039006]
Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models to learn the mapping between RGB images, language instructions, and joint space control.<n>In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model.<n>Our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$%$.
arXiv Detail & Related papers (2025-05-09T05:32:40Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.<n>Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.<n>We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z) - PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model [4.079327215055764]
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world.
Visual Language Models (VLMs) have excelled in high-level reasoning but fall short in grasping the nuanced physical properties required for effective human-robot interaction.
We introduce PAVLM, an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud.
arXiv Detail & Related papers (2024-10-15T12:53:42Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.