Related papers: GraFT: Gradual Fusion Transformer for Multimodal Re-Identification

GraFT: Gradual Fusion Transformer for Multimodal Re-Identification

URL: http://arxiv.org/abs/2310.16856v1
Date: Wed, 25 Oct 2023 00:15:40 GMT
Title: GraFT: Gradual Fusion Transformer for Multimodal Re-Identification
Authors: Haoli Yin, Jiayao Li (Emily), Eva Schiller, Luke McDermott, Daniel Cummings
Abstract summary: We introduce the textbfGradual Fusion Transformer (GraFT) for multimodal ReID. GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks.
Score: 0.8999666725996975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object Re-Identification (ReID) is pivotal in computer vision, witnessing an escalating demand for adept multimodal representation learning. Current models, although promising, reveal scalability limitations with increasing modalities as they rely heavily on late fusion, which postpones the integration of specific modality insights. Addressing this, we introduce the \textbf{Gradual Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. Further bolstering its efficacy, we introduce a novel training paradigm combined with an augmented triplet loss, optimizing the ReID feature embedding space. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks. Additionally, aiming for deployment versatility, we've integrated neural network pruning into GraFT, offering a balance between model size and performance.

Related papers

Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers [0.0]
Gated Recurrent Fusion (GRF) is a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline.<n>Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.
arXiv Detail & Related papers (2025-07-01T09:33:38Z)
M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification [23.322598623627222]
M$3$amba is a novel end-to-end CLIP-driven Mamba model for multi-modal fusion. We introduce CLIP-driven modality-specific adapters to achieve a comprehensive semantic understanding of different modalities. Experiments have shown that M$3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T05:06:47Z)
Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification [60.9670254833103]
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras. We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID.
arXiv Detail & Related papers (2024-12-23T03:19:19Z)
MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution [14.265237560766268]
A flexible integration of attention across diverse spatial extents can yield significant performance enhancements. We introduce Multi-Range Attention Transformer (MAT) tailored for Super Resolution (SR) tasks. MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations.
arXiv Detail & Related papers (2024-11-26T08:30:31Z)
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers. We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.