Related papers: LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

URL: http://arxiv.org/abs/2501.18954v1
Date: Fri, 31 Jan 2025 08:27:31 GMT
Title: LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Authors: Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng,
Abstract summary: Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data.<n>We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
Score: 44.578308186225826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.

Related papers

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation [23.118080583803266]
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation.<n>Our key innovation is a strategy called recaptioning, focusing on the pre-detection stage.<n>For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality.
arXiv Detail & Related papers (2025-08-01T18:19:51Z)
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception [10.377899615199278]
High-quality image captions play a crucial role in improving the performance of cross-modal applications. Recent studies have employed multimodal large language models (MLLMs) to generate captions. However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations.
arXiv Detail & Related papers (2025-04-09T08:07:46Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z)
Hyperbolic Learning with Synthetic Captions for Open-World Detection [26.77840603264043]
We propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images. We also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings.
arXiv Detail & Related papers (2024-04-07T17:06:22Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z)
Linear Alignment of Vision-language Models for Image Captioning [9.746397419479447]
We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT.
arXiv Detail & Related papers (2023-07-10T17:59:21Z)
LAION-5B: An open large-scale dataset for training next generation image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset. We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.