From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping
- URL: http://arxiv.org/abs/2304.13273v3
- Date: Mon, 8 May 2023 02:29:28 GMT
- Title: From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping
- Authors: Junyang Wang and Ming Yan and Yi Zhang and Jitao Sang
- Abstract summary: We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
- Score: 20.67415815472257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of Vision-Language Pre-training Models (VLPMs)
represented by CLIP and ALIGN, significant breakthroughs have been achieved for
association-based visual tasks such as image classification and image-text
retrieval by the zero-shot capability of CLIP without fine-tuning. However,
CLIP is hard to apply to generation-based tasks. This is due to the lack of
decoder architecture and pre-training tasks for generation. Although previous
works have created generation capacity for CLIP through additional language
models, a modality gap between the CLIP representations of different modalities
and the inability of CLIP to model the offset of this gap, which fails the
concept to transfer across modalities. To solve the problem, we try to map
images/videos to the language modality and generate captions from the language
modality. In this paper, we propose the K-nearest-neighbor Cross-modality
Mapping (Knight), a zero-shot method from association to generation. With
text-only unsupervised training, Knight achieves State-of-the-Art performance
in zero-shot methods for image captioning and video captioning. Our code is
available at https://github.com/junyangwang0410/Knight.
Related papers
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Zero-shot Image Captioning by Anchor-augmented Vision-Language Space
Alignment [23.072180427273544]
We discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information.
To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning.
Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.
arXiv Detail & Related papers (2022-11-14T11:12:19Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.