Related papers: Self-Supervised Image Captioning with CLIP

Self-Supervised Image Captioning with CLIP

URL: http://arxiv.org/abs/2306.15111v2
Date: Thu, 2 Nov 2023 17:57:54 GMT
Title: Self-Supervised Image Captioning with CLIP
Authors: Chuanyang Jin
Abstract summary: We introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data. Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes inherently challenging to achieve through supervised learning.

Related papers

Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring [26.174094671736686]
We propose a novel quality-driven data selection pipeline for visual instruction tuning datasets.<n>It integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task.<n>It generates general and task-specific captions, and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry.
arXiv Detail & Related papers (2025-06-10T04:04:58Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z)
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension. This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z)
ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features. It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set. It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z)
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.