Related papers: Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

URL: http://arxiv.org/abs/2510.12931v1
Date: Tue, 14 Oct 2025 19:12:59 GMT
Title: Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
Authors: Sanghyun Byun, Jung Ick Guack, Mohanad Odema, Baisub Lee, Jacob Song, Woo Seong Chung,
Abstract summary: ViZer is an enhancement training framework that enables zero-label learning in image captioning.<n>Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training.<n>We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions.
Score: 0.5274824616260646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

Related papers

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP) We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images. We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning [22.93684323791136]
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance ICCC's zero-shot performance without the need for labeled task. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based tasks through ICCC instruction tuning.
arXiv Detail & Related papers (2024-04-01T04:28:01Z)
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [31.281236193979165]
We study the task of extending the large language model (LLM) into a vision-language instruction-following model. Existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss. We propose CG-VLM that applies Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM.
arXiv Detail & Related papers (2023-11-29T03:29:46Z)
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z)
Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks.<n>$pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency.<n>As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z)
Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z)
Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z)
Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning. We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions. In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.