Transferable Decoding with Visual Entities for Zero-Shot Image
Captioning
- URL: http://arxiv.org/abs/2307.16525v1
- Date: Mon, 31 Jul 2023 09:47:06 GMT
- Title: Transferable Decoding with Visual Entities for Zero-Shot Image
Captioning
- Authors: Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng
Zheng
- Abstract summary: ViECap is a transferable decoding model that generates descriptions in both seen and unseen scenarios.
ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image.
Our experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning.
- Score: 45.855652838621936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-to-text generation aims to describe images using natural language.
Recently, zero-shot image captioning based on pre-trained vision-language
models (VLMs) and large language models (LLMs) has made significant progress.
However, we have observed and empirically demonstrated that these methods are
susceptible to modality bias induced by LLMs and tend to generate descriptions
containing objects (entities) that do not actually exist in the image but
frequently appear during training (i.e., object hallucination). In this paper,
we propose ViECap, a transferable decoding model that leverages entity-aware
decoding to generate descriptions in both seen and unseen scenarios. ViECap
incorporates entity-aware hard prompts to guide LLMs' attention toward the
visual entities present in the image, enabling coherent caption generation
across diverse scenes. With entity-aware hard prompts, ViECap is capable of
maintaining performance when transferring from in-domain to out-of-domain
scenarios. Extensive experiments demonstrate that ViECap sets a new
state-of-the-art cross-domain (transferable) captioning and performs
competitively in-domain captioning compared to previous VLMs-based zero-shot
methods. Our code is available at: https://github.com/FeiElysia/ViECap
Related papers
- MeaCap: Memory-Augmented Zero-shot Image Captioning [11.817667500151687]
We propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap)
MeaCap can generate concept-centered captions with fewer hallucinations and more world-knowledge.
arXiv Detail & Related papers (2024-03-06T14:00:31Z) - Towards Automatic Satellite Images Captions Generation Using Large
Language Models [0.5439020425819]
We propose Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images.
We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images.
arXiv Detail & Related papers (2023-10-17T16:45:47Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Learning Visual Representations with Caption Annotations [19.24013129952071]
We propose a proxy task to learn visual representations over image-caption pairs.
ICMLM consists in predicting masked words in captions by relying on visual cues.
Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations.
arXiv Detail & Related papers (2020-08-04T08:04:16Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.