Exploring Annotation-free Image Captioning with Retrieval-augmented
Pseudo Sentence Generation
- URL: http://arxiv.org/abs/2307.14750v2
- Date: Fri, 28 Jul 2023 05:53:33 GMT
- Title: Exploring Annotation-free Image Captioning with Retrieval-augmented
Pseudo Sentence Generation
- Authors: Zhiyuan Li and Dongnan Liu and Heng Wang and Chaoyi Zhang and Weidong
Cai
- Abstract summary: We introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG) to train captioners without annotated image-sentence pairs.
RaPSG retrieves relevant short region descriptions from mismatching corpora and uses them to generate a variety of pseudo sentences with distinct representations.
We show that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters.
- Score: 23.54149252498897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training an image captioner without annotated image-sentence pairs has gained
traction in recent years. Previous approaches can be categorized into two
strategies: crawling sentences from mismatching corpora and aligning them with
the given images as pseudo annotations, or pre-training the captioner using
external image-text pairs. However, the aligning setting seems to reach its
performance limit due to the quality problem of pairs, and pre-training
requires significant computational resources. To address these challenges, we
propose a new strategy ``LPM + retrieval-augmented learning" where the prior
knowledge from large pre-trained models (LPMs) is leveraged as supervision, and
a retrieval process is integrated to further reinforce its effectiveness.
Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation
(RaPSG), which adopts an efficient approach to retrieve highly relevant short
region descriptions from the mismatching corpora and use them to generate a
variety of pseudo sentences with distinct representations as well as high
quality via LPMs. In addition, a fluency filter and a CLIP-guided training
objective are further introduced to facilitate model optimization. Experimental
results demonstrate that our method surpasses the SOTA pre-training model
(Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only
0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach
eliminates the need of computationally expensive pre-training processes on
external datasets (e.g., the requirement of 312M image-text pairs for
Flamingo3B). We further show that with a simple extension, the generated pseudo
sentences can be deployed as weak supervision to boost the 1% semi-supervised
image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the
versatility and effectiveness of our approach.
Related papers
- Pseudo-triplet Guided Few-shot Composed Image Retrieval [20.130745490934597]
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query.
We propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR.
Our scheme is plug-and-play and compatible with any existing supervised CIR models.
arXiv Detail & Related papers (2024-07-08T14:53:07Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking [34.31345844296072]
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text.
Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image.
We present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text.
arXiv Detail & Related papers (2023-12-14T13:31:01Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - A Fistful of Words: Learning Transferable Visual Models from
Bag-of-Words Supervision [32.4697157553247]
In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
A simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset.
Using a BoW pretrained model, we can obtain more training data by generating pseudo-BoW captions on images that do not have a caption.
arXiv Detail & Related papers (2021-12-27T20:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.