Related papers: Specialized Foundation Models Struggle to Beat Supervised Baselines

Specialized Foundation Models Struggle to Beat Supervised Baselines

URL: http://arxiv.org/abs/2411.02796v1
Date: Tue, 05 Nov 2024 04:10:59 GMT
Title: Specialized Foundation Models Struggle to Beat Supervised Baselines
Authors: Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen, Ameet Talwalkar, Mikhail Khodak,
Abstract summary: We look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow. We find that it is consistently possible to train simple supervised models that match or even outperform the latest foundation models.
Score: 60.23386520331143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models -- no more complicated than a lightly modified wide ResNet or UNet -- that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

Related papers

EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor? [8.178030486012437]
We present an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs)<n>Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused.<n>Our framework sets a new direction for building scalable and efficient RSFMs.
arXiv Detail & Related papers (2025-11-26T15:52:56Z)
Synergies between Federated Foundation Models and Smart Power Grids [8.179321682277818]
M3T Federated Foundation Models (FedFMs) enable scalable, privacy-preserving model training/fine-tuning across distributed data sources.<n>In this paper, we take one of the first steps toward introducing these models to the power systems research community.
arXiv Detail & Related papers (2025-09-20T02:00:07Z)
Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models [54.517276878748305]
Vision foundation models (VFMs) are predominantly developed using data-centric methods.<n>Many open-source vision models have been pretrained on domain-specific data.<n>We present a new model-driven approach for training VFMs through joint knowledge transfer and preservation.
arXiv Detail & Related papers (2025-08-20T13:30:23Z)
DINOv3 [62.31809406012177]
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures.<n>This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies.<n>DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks.
arXiv Detail & Related papers (2025-08-13T18:00:55Z)
Large-Small Model Collaborative Framework for Federated Continual Learning [20.05022827987955]
Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge.<n>We propose the first collaborative framework in Federated Continual Learning (FCL), where lightweight local models act as a dynamic bridge.<n>Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server.
arXiv Detail & Related papers (2025-08-13T04:49:50Z)
Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures [2.5382095320488673]
We focus on designing architectures to enable efficient acquisition of priors when large amounts of data are unavailable.<n>We demonstrate that our hypernetwork designs can acquire more generalizable priors than standard networks when trained with Model Agnostic Meta-Learning (MAML)<n>We extend our hypernetwork framework to perform 3D segmentation on novel scenes with limited data by efficiently transferring priors from earlier viewed scenes.
arXiv Detail & Related papers (2025-07-07T18:48:09Z)
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling. Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning. We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z)
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [12.728451197053321]
We propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale vision-language models (VLMs) Curr-ReFT comprises two sequential stages: Curriculum Reinforcement Learning and Rejected Sampling-based Self-improvement. Our experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks.
arXiv Detail & Related papers (2025-03-10T08:48:50Z)
Building 6G Radio Foundation Models with Transformer Architectures [6.70088826174291]
Foundation deep learning (DL) models are designed to learn general, robust and adaptable representations of their target modality. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL) We propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning.
arXiv Detail & Related papers (2024-11-15T07:01:44Z)
Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets. We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z)
Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal [54.93261535899478]
In real-world applications, such as robotic control of reinforcement learning, the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. We propose a rehearsal-based continual diffusion model, called Continual diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability)
arXiv Detail & Related papers (2024-09-04T08:21:47Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language. To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates. We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z)
Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities [59.02391344178202]
Vision foundation models (VFMs) serve as potent building blocks for a wide range of AI applications. The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs. This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions.
arXiv Detail & Related papers (2024-01-16T01:57:24Z)
Learn From Model Beyond Fine-Tuning: A Survey [78.80920533793595]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface. The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing. This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z)
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years. This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.