DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
- URL: http://arxiv.org/abs/2411.02733v1
- Date: Tue, 05 Nov 2024 02:03:12 GMT
- Title: DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
- Authors: Haodong Li, Haicheng Qu, Xiaofeng Zhang,
- Abstract summary: This paper introduces a high quality remote sensing LVLMs dataset, DDFAV, created using data augmentation and data mixing strategies.
Next, a training instruction set is produced based on some high-quality remote sensing images selected from the proposed dataset.
Finally, we develop a remote sensing LVLMs hallucination evaluation method RSPOPE based on the proposed dataset and evaluate the zero-shot capabilities of different LVLMs.
- Score: 11.633202976929873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of large vision language models (LVLMs), these models have shown excellent results in various multimodal tasks. Since LVLMs are prone to hallucinations and there are currently few datasets and evaluation methods specifically designed for remote sensing, their performance is typically poor when applied to remote sensing tasks. To address these issues, this paper introduces a high quality remote sensing LVLMs dataset, DDFAV, created using data augmentation and data mixing strategies. Next, a training instruction set is produced based on some high-quality remote sensing images selected from the proposed dataset. Finally, we develop a remote sensing LVLMs hallucination evaluation method RSPOPE based on the proposed dataset and evaluate the zero-shot capabilities of different LVLMs. Our proposed dataset, instruction set, and evaluation method files are available at https://github.com/HaodongLi2024/rspope.
Related papers
- Re-Aligning Language to Visual Objects with an Agentic Workflow [73.73778652260911]
Language-based object detection aims to align visual objects with language expressions.
Recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects.
We propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts.
arXiv Detail & Related papers (2025-03-30T16:41:12Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.
Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.
This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [9.238739743596236]
We propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment.
Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior interpretation accuracy.
arXiv Detail & Related papers (2025-03-02T05:44:56Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.
Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.
Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Expand VSR Benchmark for VLLM to Expertize in Spatial Rules [11.320245739677826]
Visual spatial reasoning is a basic part of human cognition which requires fine-grained perception on cross-instance.
There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning.
We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information.
To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding.
arXiv Detail & Related papers (2024-12-24T07:13:17Z) - Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [98.92677830223786]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective.
We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data.
Our proposed method achieves performance comparable to or even superior to baselines trained with many more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - Membership Inference Attacks against Large Vision-Language Models [40.996912464828696]
Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios.
Their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records.
Detecting inappropriately used data in VLLMs remains a critical and unresolved issue.
arXiv Detail & Related papers (2024-11-05T08:35:08Z) - Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [9.248637518957445]
We review the fundamental theories related to visual language models (VLMs) and the datasets constructed for them in remote sensing.
We categorize the improvement methods into three main parts according to the core components ofVLMs and provide a detailed introduction and comparison of these methods.
arXiv Detail & Related papers (2024-10-15T13:28:55Z) - Video Instruction Tuning With Synthetic Data [84.64519990333406]
We create a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K.
This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.
By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
arXiv Detail & Related papers (2024-10-03T17:36:49Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z) - LOVM: Language-Only Vision Model Selection [13.857583570058392]
We introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction.
We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets.
arXiv Detail & Related papers (2023-06-15T06:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.