From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping
- URL: http://arxiv.org/abs/2304.13273v3
- Date: Mon, 8 May 2023 02:29:28 GMT
- Title: From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping
- Authors: Junyang Wang and Ming Yan and Yi Zhang and Jitao Sang
- Abstract summary: We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
- Score: 20.67415815472257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of Vision-Language Pre-training Models (VLPMs)
represented by CLIP and ALIGN, significant breakthroughs have been achieved for
association-based visual tasks such as image classification and image-text
retrieval by the zero-shot capability of CLIP without fine-tuning. However,
CLIP is hard to apply to generation-based tasks. This is due to the lack of
decoder architecture and pre-training tasks for generation. Although previous
works have created generation capacity for CLIP through additional language
models, a modality gap between the CLIP representations of different modalities
and the inability of CLIP to model the offset of this gap, which fails the
concept to transfer across modalities. To solve the problem, we try to map
images/videos to the language modality and generate captions from the language
modality. In this paper, we propose the K-nearest-neighbor Cross-modality
Mapping (Knight), a zero-shot method from association to generation. With
text-only unsupervised training, Knight achieves State-of-the-Art performance
in zero-shot methods for image captioning and video captioning. Our code is
available at https://github.com/junyangwang0410/Knight.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.