Related papers: Two-stream Hierarchical Similarity Reasoning for Image-text Matching

Two-stream Hierarchical Similarity Reasoning for Image-text Matching

URL: http://arxiv.org/abs/2203.05349v1
Date: Thu, 10 Mar 2022 12:56:10 GMT
Title: Two-stream Hierarchical Similarity Reasoning for Image-text Matching
Authors: Ran Chen, Hanli Wang, Lei Wang, Sam Kwong
Abstract summary: A hierarchical similarity reasoning module is proposed to automatically extract context information. Previous approaches only consider learning single-stream similarity alignment. A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
Score: 66.43071159630006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning-based approaches have demonstrated their powerful ability for the task of image-text matching. In this work, two issues are addressed for image-text matching. First, for reasoning processing, conventional approaches have no ability to find and use multi-level hierarchical similarity information. To solve this problem, a hierarchical similarity reasoning module is proposed to automatically extract context information, which is then co-exploited with local interaction information for efficient reasoning. Second, previous approaches only consider learning single-stream similarity alignment (i.e., image-to-text level or text-to-image level), which is inadequate to fully use similarity information for image-text matching. To address this issue, a two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation. These two issues are investigated by a unifying framework that is trained in an end-to-end manner, namely two-stream hierarchical similarity reasoning network. The extensive experiments performed on the two benchmark datasets of MSCOCO and Flickr30K show the superiority of the proposed approach as compared to existing state-of-the-art methods.

Related papers

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL [78.59912944698992]
We propose FocusDiff to enhance fine-grained text-image semantic alignment.<n>We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics.<n>Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
arXiv Detail & Related papers (2025-06-05T18:36:33Z)
Descriptive Image-Text Matching with Graded Contextual Similarity [41.10869519062159]
We propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text.<n>We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity.<n>Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways.
arXiv Detail & Related papers (2025-05-15T06:21:00Z)
Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z)
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts. Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z)
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language. Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z)
Cross-modal Image Retrieval with Deep Mutual Information Maximization [14.778158582349137]
We study the cross-modal image retrieval, where the inputs contain a source image plus some text that describes certain modifications to this image and the desired image. Our method narrows the modality gap between the text modality and the image modality by maximizing mutual information between their not exactly semantically identical representation.
arXiv Detail & Related papers (2021-03-10T13:08:09Z)
Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations. We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
Transformer Reasoning Network for Image-Text Matching and Retrieval [14.238818604272751]
We consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. We introduce the Transformer Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive, the Transformer. TERN is able to separately reason on the two different modalities and to enforce a final common abstract concept space.
arXiv Detail & Related papers (2020-04-20T09:09:01Z)
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN) Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.