Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training
- URL: http://arxiv.org/abs/2306.08789v1
- Date: Thu, 15 Jun 2023 00:19:13 GMT
- Title: Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training
- Authors: Chong Liu, Yuqi Zhang, Hongsong Wang, Weihua Chen, Fan Wang, Yan
Huang, Yi-Dong Shen, and Liang Wang
- Abstract summary: Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
- Score: 33.78990448307792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image-text retrieval is a central problem for understanding the semantic
relationship between vision and language, and serves as the basis for various
visual and language tasks. Most previous works either simply learn
coarse-grained representations of the overall image and text, or elaborately
establish the correspondence between image regions or pixels and text words.
However, the close relations between coarse- and fine-grained representations
for each modality are important for image-text retrieval but almost neglected.
As a result, such previous works inevitably suffer from low retrieval accuracy
or heavy computational cost. In this work, we address image-text retrieval from
a novel perspective by combining coarse- and fine-grained representation
learning into a unified framework. This framework is consistent with human
cognition, as humans simultaneously pay attention to the entire sample and
regional elements to understand the semantic content. To this end, a
Token-Guided Dual Transformer (TGDT) architecture which consists of two
homogeneous branches for image and text modalities, respectively, is proposed
for image-text retrieval. The TGDT incorporates both coarse- and fine-grained
retrievals into a unified framework and beneficially leverages the advantages
of both retrieval approaches. A novel training objective called Consistent
Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra-
and inter-modal semantic consistencies between images and texts in the common
embedding space. Equipped with a two-stage inference method based on the mixed
global and local cross-modal similarity, the proposed method achieves
state-of-the-art retrieval performances with extremely low inference time when
compared with representative recent approaches.
Related papers
- A New Fine-grained Alignment Method for Image-text Matching [4.33417045761714]
Cross-Modal Prominent Fragments Enhancement Aligning Network achieves improved retrieval accuracy.
In practice, we first design a novel intra-modal fragments relationship reasoning method.
Our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.
arXiv Detail & Related papers (2023-11-03T18:27:43Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Text-based Person Search in Full Images via Semantic-Driven Proposal
Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks.
To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.