ChatEarthNet: A Global-Scale Image-Text Dataset Empowering
Vision-Language Geo-Foundation Models
- URL: http://arxiv.org/abs/2402.11325v2
- Date: Mon, 26 Feb 2024 20:29:22 GMT
- Title: ChatEarthNet: A Global-Scale Image-Text Dataset Empowering
Vision-Language Geo-Foundation Models
- Authors: Zhenghang Yuan, Zhitong Xiong, Lichao Mou, and Xiao Xiang Zhu
- Abstract summary: ChatEarthNet is a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions.
ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT-3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision)
- Score: 26.583783910846723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An in-depth comprehension of global land cover is essential in Earth
observation, forming the foundation for a multitude of applications. Although
remote sensing technology has advanced rapidly, leading to a proliferation of
satellite imagery, the inherent complexity of these images often makes them
difficult for non-expert users to understand. Natural language, as a carrier of
human knowledge, can be a bridge between common users and complicated satellite
imagery. In this context, we introduce a global-scale, high-quality image-text
dataset for remote sensing, providing natural language descriptions for
Sentinel-2 data to facilitate the understanding of satellite imagery for common
users. Specifically, we utilize Sentinel-2 data for its global coverage as the
foundational image source, employing semantic segmentation labels from the
European Space Agency's (ESA) WorldCover project to enrich the descriptions of
land covers. By conducting in-depth semantic analysis, we formulate detailed
prompts to elicit rich descriptions from ChatGPT. To enhance the dataset's
quality, we introduce the manual verification process. This step involves
manual inspection and correction to refine the dataset, thus significantly
improving its accuracy and quality. Finally, we offer the community
ChatEarthNet, a large-scale image-text dataset characterized by global
coverage, high quality, wide-ranging diversity, and detailed descriptions.
ChatEarthNet consists of 163,488 image-text pairs with captions generated by
ChatGPT-3.5 and an additional 10,000 image-text pairs with captions generated
by ChatGPT-4V(ision). This dataset has significant potential for training
vision-language geo-foundation models and evaluating large vision-language
models for remote sensing. The dataset will be made publicly available.
Related papers
- SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation [12.32553804641971]
Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding.
This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M.
arXiv Detail & Related papers (2025-02-12T07:19:36Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival [8.656768875730904]
We introduce an image caption dataset LuojiaHOG, which is geospatial-aware, label-extension-friendly and comprehensive-captioned.
LuojiaHOG involves the hierarchical spatial sampling, classification system to Open Geospatial Consortium (OGC) standards, and detailed caption generation.
We also propose a CLIP-based Image Semantic Enhancement Network (CISEN) to promote sophisticated ITR.
arXiv Detail & Related papers (2024-03-16T10:46:14Z) - SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
Remote Sensing [14.79627534702196]
We construct a vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags.
With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification.
It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval.
arXiv Detail & Related papers (2023-12-20T09:19:48Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - Towards Automatic Satellite Images Captions Generation Using Large
Language Models [0.5439020425819]
We propose Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images.
We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images.
arXiv Detail & Related papers (2023-10-17T16:45:47Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.