Related papers: Learning on Model Weights using Tree Experts

Learning on Model Weights using Tree Experts

URL: http://arxiv.org/abs/2410.13569v3
Date: Tue, 03 Jun 2025 15:42:42 GMT
Title: Learning on Model Weights using Tree Experts
Authors: Eliahu Horwitz, Bar Cavia, Jonathan Kahana, Yedid Hoshen,
Abstract summary: Training machine learning models to infer missing documentation directly from model weights is challenging.<n>We identify a key property of real-world models: most public models belong to a small set of Model Trees.<n>We introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method to learn from the weights of a single model layer.
Score: 39.90685550999956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The number of publicly available models is rapidly increasing, yet most remain undocumented. Users looking for suitable models for their tasks must first determine what each model does. Training machine learning models to infer missing documentation directly from model weights is challenging, as these weights often contain significant variation unrelated to model functionality (denoted nuisance). Here, we identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. Concretely, while learning across Model Trees requires complex architectures, even a linear classifier trained on a single model layer often works within trees. While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. Notably, ProbeX is the first probing method specifically designed to learn from the weights of a single hidden model layer. We demonstrate the effectiveness of ProbeX by predicting the categories in a model's training dataset based only on its weights. Excitingly, ProbeX can map the weights of Stable Diffusion into a weight-language embedding space, enabling model search via text, i.e., zero-shot model classification.

Related papers

Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z)
We Should Chart an Atlas of All the World's Models [37.19719066562013]
We advocate for charting the world's model population in a unified structure we call the Model Atlas.<n>The Model Atlas enables applications in model forensics, meta-ML research, and model discovery.
arXiv Detail & Related papers (2025-03-13T17:59:53Z)
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging [23.44999968321367]
Soup-of-Experts can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly.
arXiv Detail & Related papers (2025-02-03T20:33:20Z)
A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.<n>We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z)
Exploring space efficiency in a tree-based linear model for extreme multi-label classification [11.18858602369985]
Extreme multi-label classification (XMC) aims to identify relevant subsets from numerous labels. Among the various approaches for XMC, tree-based linear models are effective due to their superior efficiency and simplicity. In this work, we conduct both theoretical and empirical analyses on the space to store a tree model under the assumption of sparse data.
arXiv Detail & Related papers (2024-10-12T15:02:40Z)
On the Origin of Llamas: Model Tree Heritage Recovery [39.08927346274156]
We introduce the task of Model Tree Heritage Recovery (MoTHer Recovery) for discovering Model Trees in neural networks. Our hypothesis is that model weights encode this information, the challenge is to decode the underlying tree structure given the weights. MoTHer recovery holds exciting long-term applications akin to indexing the internet by search engines.
arXiv Detail & Related papers (2024-05-28T17:59:51Z)
BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion [56.9358325168226]
We propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND) Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model. Our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model.
arXiv Detail & Related papers (2024-03-23T08:40:38Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z)
Knowledge is a Region in Weight Space for Fine-tuned Language Models [48.589822853418404]
We study how the weight space and the underlying loss landscape of different models are interconnected. We show that language models that have been finetuned on the same dataset form a tight cluster in the weight space, while models finetuned on different datasets from the same underlying task form a looser cluster.
arXiv Detail & Related papers (2023-02-09T18:59:18Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Part-Based Models Improve Adversarial Robustness [57.699029966800644]
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations.
arXiv Detail & Related papers (2022-09-15T15:41:47Z)
Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms. We show that pre-trained models and fine-tuned models have significantly high similarities in weight values. We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z)
Neural Basis Models for Interpretability [33.51591891812176]
Generalized Additive Models (GAMs) are an inherently interpretable class of models. We propose an entirely new subfamily of GAMs that utilize basis decomposition of shape functions. A small number of basis functions are shared among all features, and are learned jointly for a given task.
arXiv Detail & Related papers (2022-05-27T17:31:19Z)
Transfer training from smaller language model [6.982133308738434]
We find a method to save training time and resource cost by changing the small well-trained model to large model. We test the target model on several data sets and find it is still comparable with the source model.
arXiv Detail & Related papers (2021-04-23T02:56:02Z)
When Ensembling Smaller Models is More Efficient than Single Large Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models [59.95091850331499]
We propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%.
arXiv Detail & Related papers (2020-03-24T23:00:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.