LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval
- URL: http://arxiv.org/abs/2203.05465v1
- Date: Thu, 10 Mar 2022 16:41:12 GMT
- Title: LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval
- Authors: Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara
L. Berg, Licheng Yu
- Abstract summary: We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
- Score: 117.15862403330121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dual encoders and cross encoders have been widely used for image-text
retrieval. Between the two, the dual encoder encodes the image and text
independently followed by a dot product, while the cross encoder jointly feeds
image and text as the input and performs dense multi-modal fusion. These two
architectures are typically modeled separately without interaction. In this
work, we propose LoopITR, which combines them in the same network for joint
learning. Specifically, we let the dual encoder provide hard negatives to the
cross encoder, and use the more discriminative cross encoder to distill its
predictions back to the dual encoder. Both steps are efficiently performed
together in the same model. Our work centers on empirical analyses of this
combined architecture, putting the main focus on the design of the distillation
objective. Our experimental results highlight the benefits of training the two
encoders in the same network, and demonstrate that distillation can be quite
effective with just a few hard negative examples. Experiments on two standard
datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual
encoder performance when compared with approaches using a similar amount of
data.
Related papers
- How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? [99.87554379608224]
Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal.
Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.
We propose a novel Contrastive Partial Ranking Distillation (DCPR) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning.
arXiv Detail & Related papers (2024-07-10T09:10:01Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Cross-stitching Text and Knowledge Graph Encoders for Distantly
Supervised Relation Extraction [30.274065305756057]
Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG)
Here, we introduce cross-stitch bi-encoders, which allow full interaction between the text encoder and the KG encoder via a cross-stitch mechanism.
arXiv Detail & Related papers (2022-11-02T19:01:26Z) - Distilled Dual-Encoder Model for Vision-Language Understanding [50.42062182895373]
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks.
We show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements.
arXiv Detail & Related papers (2021-12-16T09:21:18Z) - Crosslink-Net: Double-branch Encoder Segmentation Network via Fusing
Vertical and Horizontal Convolutions [58.71117402626524]
We present a novel double-branch encoder architecture for medical image segmentation.
Our architecture is inspired by two observations: 1) Since the discrimination of features learned via square convolutional kernels needs to be further improved, we propose to utilize non-square vertical and horizontal convolutional kernels.
The experiments validate the effectiveness of our model on four datasets.
arXiv Detail & Related papers (2021-07-24T02:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.