VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing
- URL: http://arxiv.org/abs/2110.11338v1
- Date: Wed, 20 Oct 2021 09:00:51 GMT
- Title: VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing
- Authors: Lisai Zhang and Hongfa Wu and Qingcai Chen and Yimeng Deng and
Zhonghua Li and Dejiang Kong and Zhao Cao and Joanna Siebert and Yunpeng Han
- Abstract summary: Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval.
We propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text.
- Score: 7.890230091463883
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-language transformers (VL transformers) have shown impressive accuracy
in cross-modal retrieval. However, most of the existing VL transformers use
early-interaction dataflow that computes a joint representation for the
text-image input. In the retrieval stage, such models need to infer on all the
matched text-image combinations, which causes high computing costs. The goal of
this paper is to decompose the early-interaction dataflow inside the
pre-trained VL transformer to achieve acceleration while maintaining its
outstanding accuracy. To achieve this, we propose a novel Vision-language
Transformer Decomposing (VLDeformer) to modify the VL transformer as an
individual encoder for a single image or text through contrastive learning,
which accelerates retrieval speed by thousands of times. Meanwhile, we propose
to compose bi-modal hard negatives for the contrastive learning objective,
which enables the VLDeformer to maintain the outstanding accuracy of the
backbone VL transformer. Extensive experiments on COCO and Flickr30k datasets
demonstrate the superior performance of the proposed method. Considering both
effectiveness and efficiency, VLDeformer provides a superior selection for
cross-modal retrieval in the similar pre-training datascale.
Related papers
- HTR-VT: Handwritten Text Recognition with Vision Transformer [7.997204893256558]
We explore the application of Vision Transformer (ViT) for handwritten text recognition.
Previous transformer-based models required external data or extensive pre-training on large datasets to excel.
We find that incorporating a ConAwareal Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness Minimization (SAM) encoder ensures that the model can converge towards flatter minima.
arXiv Detail & Related papers (2024-09-13T06:46:23Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - MAGVLT: Masked Generative Vision-and-Language Transformer [15.796199345773879]
We explore a unified generative vision-and-language model that can produce both images and text sequences.
We propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT)
For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks.
arXiv Detail & Related papers (2023-03-21T21:49:39Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.