RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
- URL: http://arxiv.org/abs/2408.14744v1
- Date: Tue, 27 Aug 2024 02:45:26 GMT
- Title: RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
- Authors: Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang,
- Abstract summary: We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
- Score: 3.178739428363249
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.
Related papers
- LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation [21.91073335335992]
We introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images.
LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment.
Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks.
arXiv Detail & Related papers (2024-11-14T09:23:40Z) - Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation [25.252173311925027]
We propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios.
We utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs.
With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs.
arXiv Detail & Related papers (2024-10-26T11:19:07Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.