GraFT: Gradual Fusion Transformer for Multimodal Re-Identification
- URL: http://arxiv.org/abs/2310.16856v1
- Date: Wed, 25 Oct 2023 00:15:40 GMT
- Title: GraFT: Gradual Fusion Transformer for Multimodal Re-Identification
- Authors: Haoli Yin, Jiayao Li (Emily), Eva Schiller, Luke McDermott, Daniel
Cummings
- Abstract summary: We introduce the textbfGradual Fusion Transformer (GraFT) for multimodal ReID.
GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features.
We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks.
- Score: 0.8999666725996975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object Re-Identification (ReID) is pivotal in computer vision, witnessing an
escalating demand for adept multimodal representation learning. Current models,
although promising, reveal scalability limitations with increasing modalities
as they rely heavily on late fusion, which postpones the integration of
specific modality insights. Addressing this, we introduce the \textbf{Gradual
Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs
learnable fusion tokens that guide self-attention across encoders, adeptly
capturing both modality-specific and object-specific features. Further
bolstering its efficacy, we introduce a novel training paradigm combined with
an augmented triplet loss, optimizing the ReID feature embedding space. We
demonstrate these enhancements through extensive ablation studies and show that
GraFT consistently surpasses established multimodal ReID benchmarks.
Additionally, aiming for deployment versatility, we've integrated neural
network pruning into GraFT, offering a balance between model size and
performance.
Related papers
- MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution [14.265237560766268]
A flexible integration of attention across diverse spatial extents can yield significant performance enhancements.
We introduce Multi-Range Attention Transformer (MAT) tailored for Super Resolution (SR) tasks.
MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations.
arXiv Detail & Related papers (2024-11-26T08:30:31Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.