Distilled Dual-Encoder Model for Vision-Language Understanding
- URL: http://arxiv.org/abs/2112.08723v1
- Date: Thu, 16 Dec 2021 09:21:18 GMT
- Title: Distilled Dual-Encoder Model for Vision-Language Understanding
- Authors: Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei
- Abstract summary: We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks.
We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
- Score: 50.42062182895373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a cross-modal attention distillation framework to train a
dual-encoder model for vision-language understanding tasks, such as visual
reasoning and visual question answering. Dual-encoder models have a faster
inference speed than fusion-encoder models and enable the pre-computation of
images and text during inference. However, the shallow interaction module used
in dual-encoder models is insufficient to handle complex vision-language
understanding tasks. In order to learn deep interactions of images and text, we
introduce cross-modal attention distillation, which uses the image-to-text and
text-to-image attention distributions of a fusion-encoder model to guide the
training of our dual-encoder model. In addition, we show that applying the
cross-modal attention distillation for both pre-training and fine-tuning stages
achieves further improvements. Experimental results demonstrate that the
distilled dual-encoder model achieves competitive performance for visual
reasoning, visual entailment and visual question answering tasks while enjoying
a much faster inference speed than fusion-encoder models. Our code and models
will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.
Related papers
- Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models.
Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models.
We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z) - FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition.
We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3.
With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - Toward Interpretability of Dual-Encoder Models for Dialogue Response
Suggestions [18.117115200484708]
We present an attentive dual encoder model that includes an attention mechanism on top of the extracted word-level features from two encoders.
We design a novel regularization loss to minimize the mutual information between unimportant words and desired labels.
Experiments demonstrate the effectiveness of the proposed model in terms of better Recall@1 accuracy and visualized interpretability.
arXiv Detail & Related papers (2020-03-02T21:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.