Alleviating Noisy Data in Image Captioning with Cooperative Distillation
- URL: http://arxiv.org/abs/2012.11691v1
- Date: Mon, 21 Dec 2020 21:32:28 GMT
- Title: Alleviating Noisy Data in Image Captioning with Cooperative Distillation
- Authors: Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia
Rigotti, Jarret Ross, Yair Schiff
- Abstract summary: We propose a new technique that combines clean curated datasets with the web-scale automatically extracted captions of the Google Conceptual Captions dataset (GCC)
GCC has poor descriptions of images, but is abundant in size and therefore provides a rich vocabulary resulting in more expressive captions.
- Score: 27.623398746616026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning systems have made substantial progress, largely due to the
availability of curated datasets like Microsoft COCO or Vizwiz that have
accurate descriptions of their corresponding images. Unfortunately, scarce
availability of such cleanly labeled data results in trained algorithms
producing captions that can be terse and idiosyncratically specific to details
in the image. We propose a new technique, cooperative distillation that
combines clean curated datasets with the web-scale automatically extracted
captions of the Google Conceptual Captions dataset (GCC), which can have poor
descriptions of images, but is abundant in size and therefore provides a rich
vocabulary resulting in more expressive captions.
Related papers
- COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation [38.09277249986138]
COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks.
COCONut-PanCap supports improved training of vision-language models for image understanding and generative models for text-to-image tasks.
arXiv Detail & Related papers (2025-02-04T18:59:46Z) - Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.
We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.
Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Retrieval-augmented Image Captioning [15.266569206458648]
We present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore.
The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT.
Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
arXiv Detail & Related papers (2023-02-16T12:54:13Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Noise-aware Learning from Web-crawled Image-Text Data for Image
Captioning [6.101765622702223]
Noise-aware Captioning (NoC) framework learns rich knowledge from the whole web-crawled data while being less affected by the noises.
This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal.
An in-depth analysis shows the effectiveness of our framework in handling noise.
arXiv Detail & Related papers (2022-12-27T17:33:40Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.