ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
- URL: http://arxiv.org/abs/2409.15250v1
- Date: Mon, 23 Sep 2024 17:47:59 GMT
- Title: ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
- Authors: Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel,
- Abstract summary: We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
- Score: 55.07988373824348
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks .
Related papers
- Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust [9.647148940880381]
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies.
We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that dynamically identifies regions of the input image that the model is sensitive to.
We show that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds.
arXiv Detail & Related papers (2024-10-02T19:29:24Z) - Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study [7.8735930411335895]
Vision-language-action (VLA) models have attracted much attention regarding their potential to advance robotic manipulation.
Despite the end-to-end perception-control loop offered by the VLA models, there is a lack of comprehensive understanding of the capabilities of such models.
We present VLATest, a testing framework that automatically generates diverse robotic manipulation scenes to assess the performance of VLA models.
arXiv Detail & Related papers (2024-09-19T16:33:00Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z) - VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.