Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2308.04343v1
- Date: Tue, 8 Aug 2023 15:43:59 GMT
- Title: Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
- Authors: Yi Bin, Haoxuan Li, Yahui Xu, Xing Xu, Yang Yang, Heng Tao Shen
- Abstract summary: Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
- Score: 68.61855682218298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing cross-modal retrieval methods employ two-stream encoders with
different architectures for images and texts, \textit{e.g.}, CNN for images and
RNN/Transformer for texts. Such discrepancy in architectures may induce
different semantic distribution spaces and limit the interactions between
images and texts, and further result in inferior alignment between images and
texts. To fill this research gap, inspired by recent advances of Transformers
in vision tasks, we propose to unify the encoder architectures with
Transformers for both modalities. Specifically, we design a cross-modal
retrieval framework purely based on two-stream Transformers, dubbed
\textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image
Transformer, a text Transformer, and a hierarchical alignment module. With such
identical architectures, the encoders could produce representations with more
similar characteristics for images and texts, and make the interactions and
alignments between them much easier. Besides, to leverage the rich semantics,
we devise a hierarchical alignment scheme to explore multi-level
correspondences of different layers between images and texts. To evaluate the
effectiveness of the proposed HAT, we conduct extensive experiments on two
benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that
HAT outperforms SOTA baselines by a large margin. Specifically, on two key
tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves
7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\%
and 11.6\% on Flickr30k respectively. The code is available at
\url{https://github.com/LuminosityX/HAT}.
Related papers
- SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - Pure Transformer with Integrated Experts for Scene Text Recognition [11.089203218000854]
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes.
Recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency.
This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models.
arXiv Detail & Related papers (2022-11-09T15:26:59Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Embedding Arithmetic for Text-driven Image Transformation [48.7704684871689]
Text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man.
Recent works aim at bridging this semantic gap embed images and text into a multimodal space.
We introduce the SIMAT dataset to evaluate the task of text-driven image transformation.
arXiv Detail & Related papers (2021-12-06T16:51:50Z) - L-Verse: Bidirectional Generation Between Image and Text [41.133824156046394]
L-Verse is a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART)
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild.
L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
arXiv Detail & Related papers (2021-11-22T11:48:26Z) - Unifying Multimodal Transformer for Bi-directional Image and Text
Generation [8.547205551848462]
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks.
We propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.
arXiv Detail & Related papers (2021-10-19T06:01:24Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Image Captioning through Image Transformer [29.91581534937757]
We introduce the textbftextitimage transformer, which consists of a modified encoding transformer and an implicit decoding transformer.
Our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
arXiv Detail & Related papers (2020-04-29T14:30:57Z) - Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval.
We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer.
TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.