Related papers: Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

URL: http://arxiv.org/abs/2511.17965v1
Date: Sat, 22 Nov 2025 07:58:46 GMT
Title: Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification
Authors: Yangyang Liu, Yuhao Wang, Pingping Zhang,
Abstract summary: Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information.<n>We propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID.
Score: 43.774470057037526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.

Related papers

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition [3.4568313440884837]
We present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework.<n>We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs.<n>We develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimize the process by synchronizing multimodal representation with label descriptions.
arXiv Detail & Related papers (2025-03-25T09:09:30Z)
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [79.58755811919366]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification [25.781336502845395]
Multi-modal object ReIDentification aims to retrieve specific objects by combining complementary information from multiple modalities.<n>We propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts.
arXiv Detail & Related papers (2024-12-14T02:36:56Z)
Multi-level Matching Network for Multimodal Entity Linking [28.069585532270985]
Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base.<n>We propose a Multi-level Matching network for Multimodal Entity Linking (M3EL)<n>M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity; and (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching,
arXiv Detail & Related papers (2024-12-11T10:26:17Z)
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z)
UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned. This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training. In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.