Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples
- URL: http://arxiv.org/abs/2405.16301v1
- Date: Sat, 25 May 2024 16:50:33 GMT
- Title: Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples
- Authors: Dae Ung Jo, Kyuewang Lee, JaeHo Chung, Jin Young Choi,
- Abstract summary: Securing sufficient amount of paired data is important to train an image-text retrieval (ITR) model.
We propose an active learning algorithm for ITR that can collect paired data cost-efficiently.
We validate the effectiveness of the proposed method on Flickr30K and MS-COCO datasets.
- Score: 7.883521157895832
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Securing a sufficient amount of paired data is important to train an image-text retrieval (ITR) model, but collecting paired data is very expensive. To address this issue, in this paper, we propose an active learning algorithm for ITR that can collect paired data cost-efficiently. Previous studies assume that image-text pairs are given and their category labels are asked to the annotator. However, in the recent ITR studies, the importance of category label is decreased since a retrieval model can be trained with only image-text pairs. For this reason, we set up an active learning scenario where unpaired images (or texts) are given and the annotator provides corresponding texts (or images) to make paired data. The key idea of the proposed AL algorithm is to select unpaired images (or texts) that can be hard negative samples for existing texts (or images). To this end, we introduce a novel scoring function to choose hard negative samples. We validate the effectiveness of the proposed method on Flickr30K and MS-COCO datasets.
Related papers
- Active Mining Sample Pair Semantics for Image-text Matching [6.370886833310617]
This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS)
Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea.
arXiv Detail & Related papers (2023-11-09T15:03:57Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval [89.30660533051514]
Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa.
Image-text retrieval models commonly learn to spurious correlations in the training data, such as frequent object co-occurrence.
We introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
arXiv Detail & Related papers (2023-04-06T21:45:46Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Curriculum Learning for Data-Efficient Vision-Language Alignment [29.95935291982015]
Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data.
We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data.
TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1% as much training data.
arXiv Detail & Related papers (2022-07-29T07:45:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.