Image Captioners Sometimes Tell More Than Images They See
- URL: http://arxiv.org/abs/2305.02932v2
- Date: Thu, 11 May 2023 03:58:29 GMT
- Title: Image Captioners Sometimes Tell More Than Images They See
- Authors: Honori Udo and Takafumi Koshinaka
- Abstract summary: Image captioning, a.k.a. "image-to-text," generates descriptive text from given images.
We have performed experiments involving the classification of images from descriptive text alone.
We have evaluated several image captioning models with respect to a disaster image classification task, CrisisNLP.
- Score: 8.640488282016351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning, a.k.a. "image-to-text," which generates descriptive text
from given images, has been rapidly developing throughout the era of deep
learning. To what extent is the information in the original image preserved in
the descriptive text generated by an image captioner? To answer that question,
we have performed experiments involving the classification of images from
descriptive text alone, without referring to the images at all, and compared
results with those from standard image-based classifiers. We have evaluate
several image captioning models with respect to a disaster image classification
task, CrisisNLP, and show that descriptive text classifiers can sometimes
achieve higher accuracy than standard image-based classifiers. Further, we show
that fusing an image-based classifier with a descriptive text classifier can
provide improvement in accuracy.
Related papers
- ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts.
We show that, for some attributes, images can represent concepts more expressively than text.
We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z) - GIST: Generating Image-Specific Text for Fine-grained Object
Classification [8.118079247462425]
GIST is a method for generating image-specific fine-grained text descriptions from image-only datasets.
Our method achieves an average improvement of $4.1%$ in accuracy over CLIP linear probes.
arXiv Detail & Related papers (2023-07-21T02:47:18Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Revising Image-Text Retrieval via Multi-Modal Entailment [25.988058843564335]
Many-to-many matching phenomenon is quite common in the widely-used image-text retrieval datasets.
We propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions.
arXiv Detail & Related papers (2022-08-22T07:58:54Z) - From images in the wild to video-informed image classification [0.7804710977378488]
This paper describes experiments applying state-of-the-art object classifiers toward a unique set of images in the wild with high visual complexity collected on the island of Bali.
The text describes differences between actual images in the wild and images from Imagenet, and then discusses a novel approach combining informational cues particular to video with an ensemble of imperfect classifiers in order to improve classification results on video sourced images of plants in the wild.
arXiv Detail & Related papers (2021-09-24T15:53:37Z) - Multi-Modal Image Captioning for the Visually Impaired [0.0]
One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems.
Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions.
In this work, we propose altering AoANet, a state-of-the-art image captioning model, to leverage the text detected in the image as an input feature.
arXiv Detail & Related papers (2021-05-17T18:35:24Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.