UniCat: Crafting a Stronger Fusion Baseline for Multimodal
Re-Identification
- URL: http://arxiv.org/abs/2310.18812v1
- Date: Sat, 28 Oct 2023 20:30:59 GMT
- Title: UniCat: Crafting a Stronger Fusion Baseline for Multimodal
Re-Identification
- Authors: Jennifer Crawford, Haoli Yin, Luke McDermott, Daniel Cummings
- Abstract summary: We show that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation.
We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion.
Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, exceed the current state-of-the-art performance across several multimodal ReID benchmarks.
- Score: 0.9831489366502301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Re-Identification (ReID) is a popular retrieval task that aims to
re-identify objects across diverse data streams, prompting many researchers to
integrate multiple modalities into a unified representation. While such fusion
promises a holistic view, our investigations shed light on potential pitfalls.
We uncover that prevailing late-fusion techniques often produce suboptimal
latent representations when compared to methods that train modalities in
isolation. We argue that this effect is largely due to the inadvertent
relaxation of the training objectives on individual modalities when using
fusion, what others have termed modality laziness. We present a nuanced
point-of-view that this relaxation can lead to certain modalities failing to
fully harness available task-relevant information, and yet, offers a protective
veil to noisy modalities, preventing them from overfitting to task-irrelevant
data. Our findings also show that unimodal concatenation (UniCat) and other
late-fusion ensembling of unimodal backbones, when paired with best-known
training techniques, exceed the current state-of-the-art performance across
several multimodal ReID benchmarks. By unveiling the double-edged sword of
"modality laziness", we motivate future research in balancing local modality
strengths with global representations.
Related papers
- Progressively Modality Freezing for Multi-Modal Entity Alignment [27.77877721548588]
We propose a novel strategy of progressive modality freezing, called PMF, that focuses on alignmentrelevant features.
Notably, our approach introduces a pioneering cross-modal association loss to foster modal consistency.
Empirical evaluations across nine datasets confirm PMF's superiority.
arXiv Detail & Related papers (2024-07-23T04:22:30Z) - Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images.
We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem.
We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - One-stage Modality Distillation for Incomplete Multimodal Learning [7.791488931628906]
This paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion.
The proposed framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.
arXiv Detail & Related papers (2023-09-15T07:12:27Z) - Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method [3.0903319879656084]
This paper introduces an innovative approach to feature alignment that revolutionizes the fusion of multimodal information.
Our method employs a novel iterative process of telescopic displacement and expansion of feature representations across different modalities, culminating in a coherent unified representation within a shared feature space.
arXiv Detail & Related papers (2023-06-29T13:49:06Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - Progressive Fusion for Multimodal Integration [12.94175198001421]
We present an iterative representation refinement approach, called Progressive Fusion, which mitigates the issues with late fusion representations.
We show that our approach consistently improves performance, for instance attaining a 5% reduction in MSE and 40% improvement in robustness on multimodal time series prediction.
arXiv Detail & Related papers (2022-09-01T09:08:33Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.