Related papers: A Foundational Multi-Modal Model for Few-Shot Learning

A Foundational Multi-Modal Model for Few-Shot Learning

URL: http://arxiv.org/abs/2508.04746v1
Date: Wed, 06 Aug 2025 06:12:13 GMT
Title: A Foundational Multi-Modal Model for Few-Shot Learning
Authors: Pengtao Dang, Tingbo Guo, Sha Cao, Chi Zhang,
Abstract summary: Few-shot learning (FSL) aims to generalize models from a small number of labeled examples, typically fewer than 10 per class.<n>We present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM) can substantially improve the generalization of FSL models.<n>Our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.
Score: 4.57727355942957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot learning (FSL) is a machine learning paradigm that aims to generalize models from a small number of labeled examples, typically fewer than 10 per class. FSL is particularly crucial in biomedical, environmental, materials, and mechanical sciences, where samples are limited and data collection is often prohibitively costly, time-consuming, or ethically constrained. In this study, we present an innovative approach to FSL by demonstrating that a Large Multi-Modal Model (LMMM), trained on a set of independent tasks spanning diverse domains, task types, and input modalities, can substantially improve the generalization of FSL models, outperforming models based on conventional meta-learning on tasks of the same type. To support this, we first constructed a Multi-Modal Model Few-shot Dataset (M3FD, over 10K+ few-shot samples), which includes 2D RGB images, 2D/3D medical scans, tabular and time-course datasets, from which we manually curated FSL tasks such as classification. We further introduced M3F (Multi-Modal Model for Few-shot learning framework), a novel Large Multi-Modal Model framework tailored for data-constrained scientific applications. M3F supports a wide range of scientific data types through a modular pipeline. By fine-tuning the model on M3FD, M3F improves model performance, making LMMM feasible for real-world FSL deployment. The source code is located at https://github.com/ptdang1001/M3F. To democratize access to complex FSL data and promote reproducibility for public usage, M3FD is paired with a flexible and user-friendly tool that enables efficient querying, task-specific sampling, and preprocessing. Together, our dataset and framework offer a unified, scalable solution that significantly lowers the barrier to applying LMMMs in data-scarce scientific domains.

Related papers

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
Multi-convex Programming for Discrete Latent Factor Models Prototyping [8.322623345761961]
We propose a generic framework based on CVXPY, which allows users to specify and solve the fitting problem of a wide range of DLFMs.<n>Our framework is flexible and inherently supports the integration of regularization terms and constraints on the DLFM parameters and latent factors.
arXiv Detail & Related papers (2025-04-02T07:33:54Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
On Domain-Adaptive Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.<n>LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.<n>Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning [25.45278447786954]
We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL)<n>Our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources.
arXiv Detail & Related papers (2024-09-09T21:04:16Z)
Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine [10.774128925670183]
Multimodal Lego (MM-Lego) is a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning.<n>We show that MM-Lego can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning.
arXiv Detail & Related papers (2024-05-30T11:14:01Z)
Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning [35.03338699349037]
We propose a novel in-context learning framework, FeatLLM, which employs Large Language Models as feature engineers. FeatLLM generates high-quality rules, significantly (10% on average) outperforming alternatives such as TabLLM and STUNT.
arXiv Detail & Related papers (2024-04-15T06:26:08Z)
Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning. We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)
MDFM: Multi-Decision Fusing Model for Few-Shot Learning [16.47647579893923]
We propose a novel method Multi-Decision Fusing Model (MDFM) to enhance the efficacy and robustness of the model. We evaluate the proposed method on five benchmark datasets and achieve significant improvements of 3.4%-7.3% compared with state-of-the-arts.
arXiv Detail & Related papers (2021-12-01T18:13:09Z)
Federated Mutual Learning [65.46254760557073]
Federated Mutual Leaning (FML) allows clients training a generalized model collaboratively and a personalized model independently. The experiments show that FML can achieve better performance than alternatives in typical Federated learning setting.
arXiv Detail & Related papers (2020-06-27T09:35:03Z)
Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model. In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side. We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.