Denoising Large-Scale Image Captioning from Alt-text Data using Content
Selection Models
- URL: http://arxiv.org/abs/2009.05175v2
- Date: Fri, 16 Apr 2021 23:11:48 GMT
- Title: Denoising Large-Scale Image Captioning from Alt-text Data using Content
Selection Models
- Authors: Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish
Thapliyal, Radu Soricut
- Abstract summary: We show that selecting content words as skeletons helps in generating improved and denoised captions.
We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions.
We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression.
- Score: 25.86785379429413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large-scale image captioning (IC) models demands access to a rich
and diverse set of training examples, gathered from the wild, often from noisy
alt-text data. However, recent modeling approaches to IC often fall short in
terms of performance in this case, because they assume a clean annotated
dataset (as opposed to the noisier alt-text--based annotations), and employ an
end-to-end generation approach, which often lacks both controllability and
interpretability. We address these problems by breaking down the task into two
simpler, more controllable tasks -- skeleton prediction and skeleton-based
caption generation. Specifically, we show that selecting content words as
skeletons} helps in generating improved and denoised captions when leveraging
rich yet noisy alt-text--based uncurated datasets. We also show that the
predicted English skeletons can be further cross-lingually leveraged to
generate non-English captions, and present experimental results covering
caption generation in French, Italian, German, Spanish and Hindi. We also show
that skeleton-based prediction allows for better control of certain caption
properties, such as length, content, and gender expression, providing a handle
to perform human-in-the-loop semi-automatic corrections.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Cross-modal Language Generation using Pivot Stabilization for Web-scale
Language Coverage [23.71195344840051]
Cross-modal language generation tasks such as image captioning are directly hurt by the trend of data-hungry models combined with the lack of non-English annotations.
We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations and their machine-translated versions.
We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages.
arXiv Detail & Related papers (2020-05-01T06:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.