Learning Granularity-Unified Representations for Text-to-Image Person
Re-identification
- URL: http://arxiv.org/abs/2207.07802v1
- Date: Sat, 16 Jul 2022 01:26:10 GMT
- Title: Learning Granularity-Unified Representations for Text-to-Image Person
Re-identification
- Authors: Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, Changxing
Ding
- Abstract summary: Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions.
Existing works usually ignore the difference in feature granularity between the two modalities.
We propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR.
- Score: 29.04254233799353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image person re-identification (ReID) aims to search for pedestrian
images of an interested identity via textual descriptions. It is challenging
due to both rich intra-modal variations and significant inter-modal gaps.
Existing works usually ignore the difference in feature granularity between the
two modalities, i.e., the visual features are usually fine-grained while
textual features are coarse, which is mainly responsible for the large
inter-modal gaps. In this paper, we propose an end-to-end framework based on
transformers to learn granularity-unified representations for both modalities,
denoted as LGUR. LGUR framework contains two modules: a Dictionary-based
Granularity Alignment (DGA) module and a Prototype-based Granularity
Unification (PGU) module. In DGA, in order to align the granularities of two
modalities, we introduce a Multi-modality Shared Dictionary (MSD) to
reconstruct both visual and textual features. Besides, DGA has two important
factors, i.e., the cross-modality guidance and the foreground-centric
reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set
of shared and learnable prototypes as the queries to extract diverse and
semantically aligned features for both modalities in the granularity-unified
feature space, which further promotes the ReID performance. Comprehensive
experiments show that our LGUR consistently outperforms state-of-the-arts by
large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released
at https://github.com/ZhiyinShao-H/LGUR.
Related papers
- Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition [28.136662420053568]
Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.
We propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge.
arXiv Detail & Related papers (2024-02-15T14:54:33Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation [3.809880620207714]
Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
arXiv Detail & Related papers (2023-05-05T15:06:08Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - Dual-path CNN with Max Gated block for Text-Based Person
Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings.
The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching.
Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z) - AlignSeg: Feature-Aligned Segmentation Networks [109.94809725745499]
We propose Feature-Aligned Networks (AlignSeg) to address misalignment issues during the feature aggregation process.
Our network achieves new state-of-the-art mIoU scores of 82.6% and 45.95%, respectively.
arXiv Detail & Related papers (2020-02-24T10:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.