Noise-aware Learning from Web-crawled Image-Text Data for Image
Captioning
- URL: http://arxiv.org/abs/2212.13563v2
- Date: Wed, 27 Sep 2023 07:26:12 GMT
- Title: Noise-aware Learning from Web-crawled Image-Text Data for Image
Captioning
- Authors: Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh
- Abstract summary: Noise-aware Captioning (NoC) framework learns rich knowledge from the whole web-crawled data while being less affected by the noises.
This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal.
An in-depth analysis shows the effectiveness of our framework in handling noise.
- Score: 6.101765622702223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning is one of the straightforward tasks that can take advantage
of large-scale web-crawled data which provides rich knowledge about the visual
world for a captioning model. However, since web-crawled data contains
image-text pairs that are aligned at different levels, the inherent noises
(e.g., misaligned pairs) make it difficult to learn a precise captioning model.
While the filtering strategy can effectively remove noisy data, it leads to a
decrease in learnable knowledge and sometimes brings about a new problem of
data deficiency. To take the best of both worlds, we propose a Noise-aware
Captioning (NoC) framework, which learns rich knowledge from the whole
web-crawled data while being less affected by the noises. This is achieved by
the proposed alignment-level-controllable captioner, which is learned using
alignment levels of the image-text pairs as a control signal during training.
The alignment-level-conditioned training allows the model to generate
high-quality captions by simply setting the control signal to the desired
alignment level at inference time. An in-depth analysis shows the effectiveness
of our framework in handling noise. With two tasks of zero-shot captioning and
text-to-image retrieval using generated captions (i.e., self-retrieval), we
also demonstrate our model can produce high-quality captions in terms of
descriptiveness and distinctiveness. The code is available at
\url{https://github.com/kakaobrain/noc}.
Related papers
- ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.