Few-shot Adaptation of Multi-modal Foundation Models: A Survey
- URL: http://arxiv.org/abs/2401.01736v2
- Date: Thu, 4 Jan 2024 13:24:48 GMT
- Title: Few-shot Adaptation of Multi-modal Foundation Models: A Survey
- Authors: Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, Delong
Chen
- Abstract summary: Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models.
In some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired.
We introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models.
- Score: 9.579784268228968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (e.g., ImageNet-based pre-training) as the new
generation of visual foundation models. These models with robust and aligned
semantic representations learned from billions of internet image-text pairs and
can be applied to various downstream tasks in a zero-shot manner. However, in
some fine-grained domains like medical imaging and remote sensing, the
performance of multi-modal foundation models often leaves much to be desired.
Consequently, many researchers have begun to explore few-shot adaptation
methods for these models, gradually deriving three main technical approaches:
1) prompt-based methods, 2) adapter-based methods, and 3) external
knowledge-based methods. Nevertheless, this rapidly developing field has
produced numerous results without a comprehensive survey to systematically
organize the research progress. Therefore, in this survey, we introduce and
analyze the research advancements in few-shot adaptation methods for
multi-modal models, summarizing commonly used datasets and experimental setups,
and comparing the results of different methods. In addition, due to the lack of
reliable theoretical support for existing methods, we derive the few-shot
adaptation generalization error bound for multi-modal models. The theorem
reveals that the generalization error of multi-modal foundation models is
constrained by three factors: domain gap, model capacity, and sample size.
Based on this, we propose three possible solutions from the following aspects:
1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive
knowledge utilization.
Related papers
- Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [43.5468667825864]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models.
It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z) - Towards Modality Generalization: A Benchmark and Prospective Analysis [56.84045461854789]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.
We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.
Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language.
To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates.
We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - Toward Robust Multimodal Learning using Multimodal Foundational Models [30.755818450393637]
We propose TRML, Toward Robust Multimodal Learning using Multimodal Foundational Models.
TRML employs generated virtual modalities to replace missing modalities.
We also design a semantic matching learning module to align semantic spaces generated and missing modalities.
arXiv Detail & Related papers (2024-01-20T04:46:43Z) - Explore and Exploit the Diverse Knowledge in Model Zoo for Domain
Generalization [40.28810906825559]
We propose a new algorithm for integrating diverse pretrained models, not limited to the strongest models, in order to achieve enhanced out-of-distribution generalization performance.
Our proposed method demonstrates state-of-the-art empirical results on a variety of datasets, thus validating the benefits of utilizing diverse knowledge.
arXiv Detail & Related papers (2023-06-05T04:58:41Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.