The Devil is in the EOS: Sequence Training for Detailed Image Captioning
- URL: http://arxiv.org/abs/2507.20077v1
- Date: Sat, 26 Jul 2025 23:00:43 GMT
- Title: The Devil is in the EOS: Sequence Training for Detailed Image Captioning
- Authors: Abdelrahman Mohamed, Yova Kementchedjhieva,
- Abstract summary: A bias towards the end-of-sequence token is introduced during cross-entropy training.<n>We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely.<n>Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
- Score: 23.507574666352486
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
Related papers
- Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z) - Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme
Detection [17.182722268446604]
We propose a probing-based captioning approach to leverage PVLMs in a zero-shot visual question answering (VQA) manner.
Specifically, we prompt a frozen PVLM by asking hateful content-related questions and use the answers as image captions.
The good performance of models with Pro-Cap on three benchmarks validates the effectiveness and generalization of the proposed method.
arXiv Detail & Related papers (2023-08-16T01:38:49Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - The EOS Decision and Length Extrapolation [103.7271774593922]
Extrapolation to unseen sequence lengths is a challenge for neural generative models of language.
We study an oracle setting to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS)
We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task.
arXiv Detail & Related papers (2020-10-14T15:46:17Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.