Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game
- URL: http://arxiv.org/abs/2011.08543v1
- Date: Tue, 17 Nov 2020 10:19:27 GMT
- Title: Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game
- Authors: Thu Nguyen, Duy Phung, Minh Hoai, Thien Huu Nguyen
- Abstract summary: Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
- Score: 53.74847926974122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personality image captioning (PIC) aims to describe an image with a natural
language caption given a personality trait. In this work, we introduce a novel
formulation for PIC based on a communication game between a speaker and a
listener. The speaker attempts to generate natural language captions while the
listener encourages the generated captions to contain discriminative
information about the input images and personality traits. In this way, we
expect that the generated captions can be improved to naturally represent the
images and express the traits. In addition, we propose to adapt the language
model GPT2 to perform caption generation for PIC. This enables the speaker and
listener to benefit from the language encoding capacity of GPT2. Our
experiments show that the proposed model achieves the state-of-the-art
performance for PIC.
Related papers
- Translating speech with just images [23.104041372055466]
We extend this connection by linking images to text via an existing image captioning system.
This approach can be used for speech translation with just images by having the audio in a different language from the generated captions.
We investigate such a system on a real low-resource language, Yorub'a, and propose a Yorub'a-to-English speech translation model.
arXiv Detail & Related papers (2024-06-11T10:29:24Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.