IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
- URL: http://arxiv.org/abs/2503.10324v1
- Date: Thu, 13 Mar 2025 13:00:31 GMT
- Title: IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
- Authors: Yuhao Wang, Yongfeng Lv, Pingping Zhang, Huchuan Lu,
- Abstract summary: We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
- Score: 60.38841251693781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
Related papers
- A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition [3.4568313440884837]
We present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework.
We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs.
We develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimize the process by synchronizing multimodal representation with label descriptions.
arXiv Detail & Related papers (2025-03-25T09:09:30Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.30364248231053]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG)<n>M2RAG is a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs)<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment [21.36633828492347]
Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD)<n>We introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation.<n>We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
arXiv Detail & Related papers (2025-02-23T06:59:22Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.<n>Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.<n>We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model [9.224965304457708]
This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework.<n> Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety.
arXiv Detail & Related papers (2024-11-19T07:16:48Z) - Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.<n>We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.