MeaCap: Memory-Augmented Zero-shot Image Captioning
- URL: http://arxiv.org/abs/2403.03715v1
- Date: Wed, 6 Mar 2024 14:00:31 GMT
- Title: MeaCap: Memory-Augmented Zero-shot Image Captioning
- Authors: Zequn Zeng, Yan Xie, Hao Zhang, Chiyu Chen, Zhengjue Wang, Bo Chen
- Abstract summary: We propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap)
MeaCap can generate concept-centered captions with fewer hallucinations and more world-knowledge.
- Score: 11.817667500151687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot image captioning (IC) without well-paired image-text data can be
divided into two categories, training-free and text-only-training. Generally,
these two types of methods realize zero-shot IC by integrating pretrained
vision-language models like CLIP for image-text similarity evaluation and a
pre-trained language model (LM) for caption generation. The main difference
between them is whether using a textual corpus to train the LM. Though
achieving attractive performance w.r.t. some metrics, existing methods often
exhibit some common drawbacks. Training-free methods tend to produce
hallucinations, while text-only-training often lose generalization capability.
To move forward, in this paper, we propose a novel Memory-Augmented zero-shot
image Captioning framework (MeaCap). Specifically, equipped with a textual
memory, we introduce a retrieve-then-filter module to get key concepts that are
highly related to the image. By deploying our proposed memory-augmented
visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate
concept-centered captions that keep high consistency with the image with fewer
hallucinations and more world-knowledge. The framework of MeaCap achieves the
state-of-the-art performance on a series of zero-shot IC settings. Our code is
available at https://github.com/joeyz0z/MeaCap.
Related papers
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Transferable Decoding with Visual Entities for Zero-Shot Image
Captioning [45.855652838621936]
ViECap is a transferable decoding model that generates descriptions in both seen and unseen scenarios.
ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image.
Our experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning.
arXiv Detail & Related papers (2023-07-31T09:47:06Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.