Distilled Dual-Encoder Model for Vision-Language Understanding
- URL: http://arxiv.org/abs/2112.08723v1
- Date: Thu, 16 Dec 2021 09:21:18 GMT
- Title: Distilled Dual-Encoder Model for Vision-Language Understanding
- Authors: Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei
- Abstract summary: We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks.
We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
- Score: 50.42062182895373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a cross-modal attention distillation framework to train a
dual-encoder model for vision-language understanding tasks, such as visual
reasoning and visual question answering. Dual-encoder models have a faster
inference speed than fusion-encoder models and enable the pre-computation of
images and text during inference. However, the shallow interaction module used
in dual-encoder models is insufficient to handle complex vision-language
understanding tasks. In order to learn deep interactions of images and text, we
introduce cross-modal attention distillation, which uses the image-to-text and
text-to-image attention distributions of a fusion-encoder model to guide the
training of our dual-encoder model. In addition, we show that applying the
cross-modal attention distillation for both pre-training and fine-tuning stages
achieves further improvements. Experimental results demonstrate that the
distilled dual-encoder model achieves competitive performance for visual
reasoning, visual entailment and visual question answering tasks while enjoying
a much faster inference speed than fusion-encoder models. Our code and models
will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.
Related papers
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models.
We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z) - FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition.
We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3.
With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z) - Toward Interpretability of Dual-Encoder Models for Dialogue Response
Suggestions [18.117115200484708]
We present an attentive dual encoder model that includes an attention mechanism on top of the extracted word-level features from two encoders.
We design a novel regularization loss to minimize the mutual information between unimportant words and desired labels.
Experiments demonstrate the effectiveness of the proposed model in terms of better Recall@1 accuracy and visualized interpretability.
arXiv Detail & Related papers (2020-03-02T21:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.