One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
- URL: http://arxiv.org/abs/2508.21451v2
- Date: Mon, 13 Oct 2025 01:25:01 GMT
- Title: One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
- Authors: Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo,
- Abstract summary: We build lightweight captioning models using a 125M- parameter language model, 56 times smaller than LLaMA-7B.<n>We evaluate their performance on single-sentence but on detailed captioning tasks.<n>We develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality by refining coarse descriptions into more precise captions.
- Score: 58.89538703878721
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning is fundamental for applications like video-grounded chatbot systems and navigation robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal LLMs (MLLMs). To address this, we first build lightweight captioning models using a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluate their performance not only on single-sentence but on detailed captioning tasks. We obtain surprising results showing that our model can achieve performance comparable to MLLMs, suggesting its potential to serve as a strong captioning specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from occasional captioning errors. We investigate the underlying causes and observe that the problems stem from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality by refining coarse descriptions into more precise captions. At its core, DeepLens improves visual grounding by re-examining the informative regions identified in the initial glance. Experimental results demonstrate the superiority of our model over both recent lightweight captioning methods and MLLMs in detailed captioning and even in long-range video QA tasks.
Related papers
- Enrich and Detect: Video Temporal Grounding with Multimodal LLMs [60.224522472631776]
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.<n>Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video.<n>We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.
arXiv Detail & Related papers (2025-10-19T22:12:45Z) - Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z) - Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models [1.9253106218929117]
Multimodal Large Language Models (MLLMs) often fail to fully leverage visual input, defaulting to strong language priors.<n>Our approach first provides insights into how MLLMs internally build visual understanding of image regions and then introduces techniques to amplify this capability.<n>We demonstrate the superior multimodal understanding of our resultant model through a detailed upstream analysis quantifying its ability to predict visually-dependent tokens as well as 10 pt boost on visually challenging tasks.
arXiv Detail & Related papers (2025-05-08T20:04:27Z) - Multi-LLM Collaborative Caption Generation in Scientific Documents [30.856381292477177]
We introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP)<n>Our approach unfolds in three key modules.<n>Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions.
arXiv Detail & Related papers (2025-01-05T14:09:12Z) - CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models [16.91226496250909]
We break down multi-modal understanding into two stages, from Coarse to Fine.<n>In the first stage, we prompt the MLLM to locate the approximate area of the answer.<n>In the second stage, we further enhance the model's focus on relevant areas through visual prompt engineering.
arXiv Detail & Related papers (2024-12-22T05:42:40Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks.<n>We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods.<n>Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling [35.098725056881655]
Large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
The generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements.
We introduce a novel framework, ViGoR, that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines.
arXiv Detail & Related papers (2024-02-09T01:00:14Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [8.526212812623202]
State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context dataset.<n>We present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.