DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification
- URL: http://arxiv.org/abs/2412.10650v1
- Date: Sat, 14 Dec 2024 02:36:56 GMT
- Title: DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification
- Authors: Yuhao Wang, Yang Liu, Aihua Zheng, Pingping Zhang,
- Abstract summary: Multi-modal object ReIDentification aims to retrieve specific objects by combining complementary information from multiple modalities.
We propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts.
- Score: 25.781336502845395
- License:
- Abstract: Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at https://github.com/924973292/DeMo.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.
Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.
We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Deep Multi-Modal Sets [29.983311598563542]
Deep Multi-Modal Sets is a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector.
We demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks.
arXiv Detail & Related papers (2020-03-03T15:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.