ViSTa Dataset: Do vision-language models understand sequential tasks?
- URL: http://arxiv.org/abs/2411.13211v2
- Date: Thu, 21 Nov 2024 16:37:32 GMT
- Title: ViSTa Dataset: Do vision-language models understand sequential tasks?
- Authors: Evžen Wybitul, Evan Ryan Gunter, Mikhail Seleznyov, David Lindner,
- Abstract summary: Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety.
We introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks.
ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments.
- Score: 6.039062076849557
- License:
- Abstract: Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.
Related papers
- From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning [26.21049702284394]
Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks.
We introduce VITask, a novel framework that enhances task-specific adaptability of VLMs by integrating task-specific models (TSMs)
arXiv Detail & Related papers (2024-10-09T01:24:04Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - A Continual Learning Paradigm for Non-differentiable Visual Programming
Frameworks on Visual Reasoning Tasks [51.053901491986025]
We propose a Continuous Learning paradigm for VisProg across various visual reasoning tasks.
Our CLVP distills the capabilities of well-trained task-specific models into the visual sub-modules in a stepwise and anti-forgetting manner.
arXiv Detail & Related papers (2023-09-18T14:28:47Z) - One does not fit all! On the Complementarity of Vision Encoders for
Vision and Language Tasks [59.49639580525051]
multimodal models are aimed at solving Vision and Language (V+L) tasks.
Current work assumes that a textitsingle pre-trained VE can serve as a general-purpose encoder.
In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary.
arXiv Detail & Related papers (2022-10-12T16:31:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.