Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
- URL: http://arxiv.org/abs/2212.07075v1
- Date: Wed, 14 Dec 2022 07:52:36 GMT
- Title: Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
- Authors: Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano,
Koichi Takeda
- Abstract summary: We propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model.
Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines.
- Score: 46.18855398491187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning models require the high-level generalization ability to
describe the contents of various images in words. Most existing approaches
treat the image-caption pairs equally in their training without considering the
differences in their learning difficulties. Several image captioning approaches
introduce curriculum learning methods that present training data with
increasing levels of difficulty. However, their difficulty measurements are
either based on domain-specific features or prior model training. In this
paper, we propose a simple yet efficient difficulty measurement for image
captioning using cross-modal similarity calculated by a pretrained
vision-language model. Experiments on the COCO and Flickr30k datasets show that
our proposed approach achieves superior performance and competitive convergence
speed to baselines without requiring heuristics or incurring additional
training costs. Moreover, the higher model performance on difficult examples
and unseen data also demonstrates the generalization ability.
Related papers
- CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data [40.88256210436378]
We present a novel weakly supervised pre-training of vision models on web-scale image-text data.
The proposed method reframes pre-training on image-text data as a classification task.
It achieves a remarkable $2.7times$ acceleration in training speed compared to contrastive learning on web-scale data.
arXiv Detail & Related papers (2024-04-24T05:13:28Z) - ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation [36.43428388918294]
Web-scale training on paired text-image data is becoming increasingly central to multimodal learning.
Standard data filtering approaches fail to remove mismatched text-image pairs.
We propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness.
arXiv Detail & Related papers (2024-03-02T20:36:10Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Image Captioners Are Scalable Vision Learners Too [61.98796478791261]
Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones.
Our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.
arXiv Detail & Related papers (2023-06-13T17:18:01Z) - Multimodal Data Augmentation for Image Captioning using Diffusion Models [12.221685807426264]
We propose a data augmentation method, leveraging a text-to-image model called Stable Diffusion, to expand the training set.
Experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods.
Further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data.
arXiv Detail & Related papers (2023-05-03T01:57:33Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.