Good at captioning, bad at counting: Benchmarking GPT-4V on Earth
observation data
- URL: http://arxiv.org/abs/2401.17600v1
- Date: Wed, 31 Jan 2024 04:57:12 GMT
- Title: Good at captioning, bad at counting: Benchmarking GPT-4V on Earth
observation data
- Authors: Chenhui Zhang, Sherrie Wang
- Abstract summary: We propose a benchmark to gauge the progress of Large Vision-Language Models (VLMs) toward being useful tools for Earth observation data.
Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation.
Our benchmark will be made publicly available at https://vleo.danielz.ch/ and on Hugging Face at https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70.
- Score: 7.797577465015058
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Vision-Language Models (VLMs) have demonstrated impressive performance
on complex tasks involving visual input with natural language instructions.
However, it remains unclear to what extent capabilities on natural images
transfer to Earth observation (EO) data, which are predominantly satellite and
aerial images less common in VLM training data. In this work, we propose a
comprehensive benchmark to gauge the progress of VLMs toward being useful tools
for EO data by assessing their abilities on scene understanding, localization
and counting, and change detection tasks. Motivated by real-world applications,
our benchmark includes scenarios like urban monitoring, disaster relief, land
use, and conservation. We discover that, although state-of-the-art VLMs like
GPT-4V possess extensive world knowledge that leads to strong performance on
open-ended tasks like location understanding and image captioning, their poor
spatial reasoning limits usefulness on object localization and counting tasks.
Our benchmark will be made publicly available at https://vleo.danielz.ch/ and
on Hugging Face at
https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70
for easy model evaluation.
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z) - Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs [38.02017186215372]
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks.
However, existing V-LLMs demonstrate weak spatial reasoning and localization awareness.
We explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs.
arXiv Detail & Related papers (2024-04-11T03:09:34Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.