Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing
- URL: http://arxiv.org/abs/2501.14905v1
- Date: Fri, 24 Jan 2025 20:13:29 GMT
- Title: Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing
- Authors: Madeline Anderson, Miriam Cha, William T. Freeman, J. Taylor Perron, Nathaniel Maidel, Kerri Cahoy,
- Abstract summary: We propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources.
We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations.
- Score: 19.344890308208555
- License:
- Abstract: Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.
Related papers
- AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning? [29.237078890377514]
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content.
Using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image.
This paper proposes a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics.
arXiv Detail & Related papers (2024-06-18T14:33:56Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images.
Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata.
We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.