Multi-Modal Image Captioning for the Visually Impaired
- URL: http://arxiv.org/abs/2105.08106v1
- Date: Mon, 17 May 2021 18:35:24 GMT
- Title: Multi-Modal Image Captioning for the Visually Impaired
- Authors: Hiba Ahsan, Nikita Bhalla, Daivat Bhatt, Kaivankumar Shah
- Abstract summary: One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems.
Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions.
In this work, we propose altering AoANet, a state-of-the-art image captioning model, to leverage the text detected in the image as an input feature.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the ways blind people understand their surroundings is by clicking
images and relying on descriptions generated by image captioning systems.
Current work on captioning images for the visually impaired do not use the
textual data present in the image when generating captions. This problem is
critical as many visual scenes contain text. Moreover, up to 21% of the
questions asked by blind people about the images they click pertain to the text
present in them. In this work, we propose altering AoANet, a state-of-the-art
image captioning model, to leverage the text detected in the image as an input
feature. In addition, we use a pointer-generator mechanism to copy the detected
text to the caption when tokens need to be reproduced accurately. Our model
outperforms AoANet on the benchmark dataset VizWiz, giving a 35% and 16.2%
performance improvement on CIDEr and SPICE scores, respectively.
Related papers
- VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Image Captioners Sometimes Tell More Than Images They See [8.640488282016351]
Image captioning, a.k.a. "image-to-text," generates descriptive text from given images.
We have performed experiments involving the classification of images from descriptive text alone.
We have evaluated several image captioning models with respect to a disaster image classification task, CrisisNLP.
arXiv Detail & Related papers (2023-05-04T15:32:41Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.