MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation
- URL: http://arxiv.org/abs/2410.22362v1
- Date: Sat, 26 Oct 2024 11:19:07 GMT
- Title: MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation
- Authors: Jialin Luo, Yuanzhi Wang, Ziqi Gu, Yide Qiu, Shuaizhen Yao, Fuyun Wang, Chunyan Xu, Wenhua Zhang, Dan Wang, Zhen Cui,
- Abstract summary: We propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios.
We utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs.
With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs.
- Score: 25.252173311925027
- License:
- Abstract: Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at https://github.com/ljl5261/MMM-RS.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis [17.83602731408318]
We introduce GAIA, a novel dataset for multi-scale, multi-sensor, and multi-modal Remote Sensing (RS) image analysis.
GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions.
GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.
arXiv Detail & Related papers (2025-02-13T18:52:14Z) - MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing [12.491684385808902]
MMO-IG is designed to generate RS images with supervised object labels from global and local aspects simultaneously.
Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph.
Our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels.
arXiv Detail & Related papers (2024-12-18T10:19:12Z) - RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.
RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.
We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models [5.981597754991408]
We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions.
arXiv Detail & Related papers (2024-08-27T02:45:26Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.
Our key idea involves challenging the model to discern both matching and distinct elements.
We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query.
To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN)
Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.