LMM-Det: Make Large Multimodal Models Excel in Object Detection
- URL: http://arxiv.org/abs/2507.18300v1
- Date: Thu, 24 Jul 2025 11:05:24 GMT
- Title: LMM-Det: Make Large Multimodal Models Excel in Object Detection
- Authors: Jincheng Li, Chunyu Xie, Ji Ao, Dawei Leng, Yuhui Yin,
- Abstract summary: We propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules.<n>Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models.<n>We claim that a large multimodal model possesses detection capability without any extra detection modules.
- Score: 0.62914438169038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
Related papers
- Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown [14.8657860984074]
multimodal large language models (MLLMs) often fail in rarely encountered domain-specific tasks due to limited relevant knowledge.<n>We construct a multimodal knowledge graph (MH-MMKG) which incorporates multi-modalities and intricate entity relations.<n>We also design a series of challenging queries based on MH-MMKG to evaluate the models' ability for complex knowledge retrieval and reasoning.
arXiv Detail & Related papers (2025-06-21T05:01:02Z) - A Simple Aerial Detection Baseline of Multimodal Language Models [33.91030170608569]
We present a simple baseline for applying multimodal aerial detection for the first time, named LMMRotate.<n>We construct the baseline by fine-tuning open-source general-purposes and achieve impressive detection performance comparable to conventional detector.
arXiv Detail & Related papers (2025-01-16T18:09:22Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features [79.45405711339322]
Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks.<n>Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks.<n>We propose an approach that leverages multimodal feature extraction from the LMM's latent space.
arXiv Detail & Related papers (2024-11-28T18:55:41Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing [8.530409994516619]
Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies.
We propose Disparity-guided Multispectral Mamba (DMM), a framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task.
arXiv Detail & Related papers (2024-07-11T02:09:59Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - CaMML: Context-Aware Multimodal Learner for Large Models [16.30752006781618]
We introduce Context-Aware MultiModal Learner (CaMML) for tuning large multimodal models (LMMs)
CaMML is crafted to seamlessly integrate multimodal contextual samples into large models, empowering the model to derive knowledge from analogous, domain-specific, up-to-date information.
Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks.
arXiv Detail & Related papers (2024-01-06T07:54:58Z) - Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection [86.24898024621008]
We present a novel large multimodal model applying vision experts for industrial anomaly detection(abbreviated to Myriad)<n>We utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions.<n>Our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD.
arXiv Detail & Related papers (2023-10-29T16:49:45Z) - An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models [116.50367506746713]
We present an empirical study of scaling LLaVA up to 33B and 65B/70B.
We find that scaling LMM consistently enhances model performance and improves language capabilities.
We hope that this study makes state-of-the-art LMM research at a larger scale more accessible.
arXiv Detail & Related papers (2023-09-18T17:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.