Related papers: Distilled Dual-Encoder Model for Vision-Language Understanding

Distilled Dual-Encoder Model for Vision-Language Understanding

URL: http://arxiv.org/abs/2112.08723v1
Date: Thu, 16 Dec 2021 09:21:18 GMT
Title: Distilled Dual-Encoder Model for Vision-Language Understanding
Authors: Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei
Abstract summary: We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks. We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
Score: 50.42062182895373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

Related papers

SPENCER: Self-Adaptive Model Distillation for Efficient Code Retrieval [46.194566677039525]
We propose a framework, which adopts Self-AdaPtive Model Distillation for Efficient CodE Retrieval, named SPENCER.<n>We show that the combination of dual-encoder and cross-encoder improves overall performance compared to solely dual-encoder-based models for code retrieval.
arXiv Detail & Related papers (2025-08-01T11:39:32Z)
Show-o2: Improved Native Unified Multimodal Models [21.78513101265258]
Show-o2 is a native unified multimodal models that leverage autoregressive modeling and flow matching.<n>Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion.
arXiv Detail & Related papers (2025-06-18T15:39:15Z)
One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models. Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
Improving Multi-modal Large Language Model through Boosting Vision Capabilities [54.344077285545005]
We focus on improving the visual understanding capability for boosting the vision-language models. We propose textbfArcana, a multiModal language model, which introduces two crucial techniques.
arXiv Detail & Related papers (2024-10-17T16:36:38Z)
FLIER: Few-shot Language Image Models Embedded with Latent Representations [2.443383032451177]
Few-shot Language Image model embedded with latent representations (FLIER) for image recognition. We first generate images and corresponding latent representations via Stable Diffusion with the textual inputs from GPT-3. With latent representations as "models-understandable pixels", we introduce a flexible convolutional neural network with two convolutional layers to be the latent encoder.
arXiv Detail & Related papers (2024-10-10T06:27:46Z)
Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation. The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z)
Toward Interpretability of Dual-Encoder Models for Dialogue Response Suggestions [18.117115200484708]
We present an attentive dual encoder model that includes an attention mechanism on top of the extracted word-level features from two encoders. We design a novel regularization loss to minimize the mutual information between unimportant words and desired labels. Experiments demonstrate the effectiveness of the proposed model in terms of better Recall@1 accuracy and visualized interpretability.
arXiv Detail & Related papers (2020-03-02T21:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.