EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
- URL: http://arxiv.org/abs/2503.23330v1
- Date: Sun, 30 Mar 2025 06:13:13 GMT
- Title: EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
- Authors: Hongxiang Jiang, Jihao Yin, Qixiong Wang, Jiaqi Feng, Guo Chen,
- Abstract summary: EagleVision is an MLLM tailored for remote sensing that excels in object detection and attribute comprehension.<n>We construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning.<n>EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks.
- Score: 3.3072144045024396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at https://github.com/XiangTodayEatsWhat/EagleVision.
Related papers
- SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.
HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.
HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.<n>Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.<n>This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models [31.34575955517015]
Finedefics is an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase.<n>We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives.<n> Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes.
arXiv Detail & Related papers (2025-01-25T08:52:43Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.
We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)
DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning [44.497776004372724]
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks.
We present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow.
To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors.
arXiv Detail & Related papers (2024-06-25T17:55:11Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)<n>It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis [45.62657605766754]
Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities.
Proximity QA is a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images.
We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis.
arXiv Detail & Related papers (2024-01-31T14:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.