AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID
- URL: http://arxiv.org/abs/2101.08238v2
- Date: Fri, 19 Mar 2021 15:28:49 GMT
- Title: AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID
- Authors: Ammarah Farooq, Muhammad Awais, Josef Kittler, Syed Safwan Khalid
- Abstract summary: Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems.
Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information.
We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
- Score: 20.700750237972155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal person re-identification (Re-ID) is critical for modern video
surveillance systems. The key challenge is to align inter-modality
representations according to semantic information present for a person and
ignore background information. In this work, we present AXM-Net, a novel CNN
based architecture designed for learning semantically aligned visual and
textual representations. The underlying building block consists of multiple
streams of feature maps coming from visual and textual modalities and a novel
learnable context sharing semantic alignment network. We also propose
complementary intra modal attention learning mechanisms to focus on more
fine-grained local details in the features along with a cross-modal affinity
loss for robust feature matching. Our design is unique in its ability to
implicitly learn feature alignments from data. The entire AXM-Net can be
trained in an end-to-end manner. We report results on both person search and
cross-modal Re-ID tasks. Extensive experimentation validates the proposed
framework and demonstrates its superiority by outperforming the current
state-of-the-art methods by a significant margin.
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - X-VILA: Cross-Modality Alignment for Large Language Model [91.96081978952283]
X-VILA is an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
We propose a visual alignment mechanism with a visual embedding highway module to address the problem of visual information loss.
X-VILA exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins.
arXiv Detail & Related papers (2024-05-29T17:59:58Z) - CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID.
We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z) - Learning Cross-modality Information Bottleneck Representation for
Heterogeneous Person Re-Identification [61.49219876388174]
Visible-Infrared person re-identification (VI-ReID) is an important and challenging task in intelligent video surveillance.
Existing methods mainly focus on learning a shared feature space to reduce the modality discrepancy between visible and infrared modalities.
We present a novel mutual information and modality consensus network, namely CMInfoNet, to extract modality-invariant identity features.
arXiv Detail & Related papers (2023-08-29T06:55:42Z) - Learning Robust Visual-Semantic Embedding for Generalizable Person
Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision.
Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training.
We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z) - Feature Disentanglement Learning with Switching and Aggregation for
Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames.
Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds.
We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - IAUnet: Global Context-Aware Feature Learning for Person
Re-Identification [106.50534744965955]
IAU block enables the feature to incorporate the globally spatial, temporal, and channel context.
It is lightweight, end-to-end trainable, and can be easily plugged into existing CNNs to form IAUnet.
Experiments show that IAUnet performs favorably against state-of-the-art on both image and video reID tasks.
arXiv Detail & Related papers (2020-09-02T13:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.