Related papers: Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

URL: http://arxiv.org/abs/2410.10663v1
Date: Mon, 14 Oct 2024 16:09:38 GMT
Title: Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework
Authors: Zhengwei Yang, Yuke Li, Qiang Sun, Basura Fernando, Heng Huang, Zheng Wang,
Abstract summary: This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
Score: 58.362064122489166
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances from multiple modalities when only a few labeled examples are available. This task presents additional challenges compared to classical few-shot learning due to the distinct visual characteristics and structural properties unique to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework consisting of two stages: the first stage involves training on abundant unimodal data, and the second stage focuses on transfer learning to adapt to novel data. Our GTL framework jointly estimates the latent shared concept across modalities and in-modality disturbance in both stages, while freezing the generative module during the transfer phase to maintain the stability of the learned representations and prevent overfitting to the limited multi-modal samples. Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets: Sketchy, TU-Berlin, Mask1K, and SKSF-A. Additionally, the results suggest that the model can estimate latent concepts from vast unimodal data and generalize these concepts to unseen modalities using only a limited number of available samples, much like human cognitive processes.

Related papers

Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval [30.98084422803278]
We introduce UNITE, a universal framework that tackles challenges through data curation and modality-aware training configurations.<n>Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance.<n>Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins.
arXiv Detail & Related papers (2025-05-26T08:09:44Z)
Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning [7.412307614007383]
Multimodal learning models are designed to bridge different modalities, such as images and text, by learning a shared representation space. These models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. We identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training.
arXiv Detail & Related papers (2024-12-10T20:36:49Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language. To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates. We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges. This often leads to the issue of missing modalities, where data for certain modalities are absent. We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z)
All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO) AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z)
Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs) Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z)
Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Decoupling Common and Unique Representations for Multimodal Self-supervised Learning [22.12729786091061]
We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities.
arXiv Detail & Related papers (2023-09-11T08:35:23Z)
On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z)
Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis [11.368438990334397]
We develop a self-supervised learning strategy to acquire independent unimodal supervisions. We conduct extensive experiments on three public multimodal baseline datasets. Our method achieves comparable performance than human-annotated unimodal labels.
arXiv Detail & Related papers (2021-02-09T14:05:02Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.