MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
Captioning
- URL: http://arxiv.org/abs/2308.13218v1
- Date: Fri, 25 Aug 2023 07:32:34 GMT
- Title: MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
Captioning
- Authors: Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, and Yuexian Zou
- Abstract summary: MultiCapCLIP can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.
Our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics.
- Score: 108.12011636732674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised visual captioning models typically require a large scale of images
or videos paired with descriptions in a specific language (i.e., the
vision-caption pairs) for training. However, collecting and labeling
large-scale datasets is time-consuming and expensive for many scenarios and
languages. Therefore, sufficient labeled pairs are usually not available. To
deal with the label shortage problem, we present a simple yet effective
zero-shot approach MultiCapCLIP that can generate visual captions for different
scenarios and languages without any labeled vision-caption pairs of downstream
datasets. In the training stage, MultiCapCLIP only requires text data for
input. Then it conducts two main steps: 1) retrieving concept prompts that
preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding
the prompts to learn writing styles to output captions in a desired language.
In the testing stage, MultiCapCLIP instead takes visual data as input directly
to retrieve the concept prompts to generate the final visual descriptions. The
extensive experiments on image and video captioning across four benchmarks and
four languages (i.e., English, Chinese, German, and French) confirm the
effectiveness of our approach. Compared with state-of-the-art zero-shot and
weakly-supervised methods, our method achieves 4.8% and 21.5% absolute
improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at
https://github.com/yangbang18/MultiCapCLIP.
Related papers
- Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - MeaCap: Memory-Augmented Zero-shot Image Captioning [11.817667500151687]
We propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap)
MeaCap can generate concept-centered captions with fewer hallucinations and more world-knowledge.
arXiv Detail & Related papers (2024-03-06T14:00:31Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.