Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
- URL: http://arxiv.org/abs/2504.06666v1
- Date: Wed, 09 Apr 2025 08:07:46 GMT
- Title: Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
- Authors: Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, Di Hu,
- Abstract summary: High-quality image captions play a crucial role in improving the performance of cross-modal applications.<n>Recent studies have employed multimodal large language models (MLLMs) to generate captions.<n>However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations.
- Score: 10.377899615199278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a \textbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters
Related papers
- Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.
Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.
In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models [44.578308186225826]
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data.<n>We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
arXiv Detail & Related papers (2025-01-31T08:27:31Z) - FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations.
We re-annotated the COCO and Visual Genome datasets using our FullAnno system.
Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Linear Alignment of Vision-language Models for Image Captioning [8.921774238325566]
We propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods.<n>We also propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment.
arXiv Detail & Related papers (2023-07-10T17:59:21Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.