Related papers: Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

URL: http://arxiv.org/abs/2409.03444v1
Date: Thu, 5 Sep 2024 11:49:53 GMT
Title: Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
Authors: Wei Lu, Rachel K. Luu, Markus J. Buehler,
Abstract summary: This work explores the effects of fine-tuning strategies on Large Language Models (LLMs) in domains such as materials science and engineering. We find that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models.
Score: 4.389938747401259
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.

Related papers

The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models [54.51795784459866]
We propose a theoretical framework of performance scaling for multi-model collaboration.<n>We show that multi-model systems follow a power-law scaling with respect to the total parameter count.<n> ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family.
arXiv Detail & Related papers (2025-12-29T09:55:12Z)
Architectural Trade-offs in Small Language Models Under Compute Constraints [0.0]
We present a systematic study of small language models under strict compute constraints.<n>We evaluate each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2.<n>Our results show that attention-based models dominate per-FLOP efficiency even at small scale, while increasing depth or context can degrade performance.
arXiv Detail & Related papers (2025-12-24T01:36:50Z)
The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging [8.930191971732649]
We present a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks.<n>Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency.<n>Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles.
arXiv Detail & Related papers (2025-09-26T08:12:13Z)
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [51.817121227562964]
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models.<n> Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties.<n>The traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment.
arXiv Detail & Related papers (2025-08-13T14:13:46Z)
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions [65.89403417819764]
We quantify the impact of design choices on language model capabilities. By incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance.
arXiv Detail & Related papers (2025-03-05T19:46:04Z)
A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models. Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning. We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z)
Exploring Model Kinship for Merging Large Language Models [52.01652098827454]
We introduce model kinship, the degree of similarity or relatedness between Large Language Models. We find that there is a certain relationship between model kinship and the performance gains after model merging. We propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets.
arXiv Detail & Related papers (2024-10-16T14:29:29Z)
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models [34.79589443380606]
The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and MoE models.
arXiv Detail & Related papers (2024-10-08T03:21:56Z)
On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks. We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly. In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z)
Towards Synthetic Trace Generation of Modeling Operations using In-Context Learning Approach [1.8874331450711404]
We propose a conceptual framework that combines modeling event logs, intelligent modeling assistants, and the generation of modeling operations. In particular, the architecture comprises modeling components that help the designer specify the system, record its operation within a graphical modeling environment, and automatically recommend relevant operations.
arXiv Detail & Related papers (2024-08-26T13:26:44Z)
PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis [14.298235969992877]
We introduce a comprehensive framework for perturbation response modeling in single cells.<n>Our approach includes a modular and user-friendly model development and evaluation platform.<n>We highlight the limitations of widely used models, such as mode collapse.
arXiv Detail & Related papers (2024-08-20T07:40:20Z)
The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA [0.0]
This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks. We assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties. We found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales.
arXiv Detail & Related papers (2024-05-02T02:20:12Z)
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning [56.50123642237106]
Common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment. We argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios. We propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-minimal partial models"
arXiv Detail & Related papers (2023-01-24T16:40:01Z)
The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning [62.601681746034956]
Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each.
arXiv Detail & Related papers (2022-09-18T18:15:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.