Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
- URL: http://arxiv.org/abs/2312.01745v1
- Date: Mon, 4 Dec 2023 09:10:24 GMT
- Title: Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
- Authors: Dixuan Lin, Yixing Peng, Jingke Meng, Wei-Shi Zheng
- Abstract summary: We show the discrepancy between image-to-text association and text-to-image association.
We propose CADA: Cross-Modal Adaptive Dual Association that finely builds bidirectional image-text detailed associations.
- Score: 32.793170116202475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image person re-identification (ReID) aims to retrieve images of a
person based on a given textual description. The key challenge is to learn the
relations between detailed information from visual and textual modalities.
Existing works focus on learning a latent space to narrow the modality gap and
further build local correspondences between two modalities. However, these
methods assume that image-to-text and text-to-image associations are
modality-agnostic, resulting in suboptimal associations. In this work, we show
the discrepancy between image-to-text association and text-to-image association
and propose CADA: Cross-Modal Adaptive Dual Association that finely builds
bidirectional image-text detailed associations. Our approach features a
decoder-based adaptive dual association module that enables full interaction
between visual and textual modalities, allowing for bidirectional and adaptive
cross-modal correspondence associations. Specifically, the paper proposes a
bidirectional association mechanism: Association of text Tokens to image
Patches (ATP) and Association of image Regions to text Attributes (ARA). We
adaptively model the ATP based on the fact that aggregating cross-modal
features based on mistaken associations will lead to feature distortion. For
modeling the ARA, since the attributes are typically the first distinguishing
cues of a person, we propose to explore the attribute-level association by
predicting the masked text phrase using the related image region. Finally, we
learn the dual associations between texts and images, and the experimental
results demonstrate the superiority of our dual formulation. Codes will be made
publicly available.
Related papers
- EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [38.30565103892611]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem.
The challenge of this task mainly lies in the larger semantic gap in entity association modeling.
We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Dual Relation Alignment for Composed Image Retrieval [24.812654620141778]
We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
arXiv Detail & Related papers (2023-09-05T12:16:14Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text
Retrieval [142.047662926209]
We propose a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model.
We generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module.
We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets.
arXiv Detail & Related papers (2022-07-29T01:21:54Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.