Related papers: Complexity-aware fine-tuning

Complexity-aware fine-tuning

URL: http://arxiv.org/abs/2506.21220v1
Date: Thu, 26 Jun 2025 13:13:24 GMT
Title: Complexity-aware fine-tuning
Authors: Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev,
Abstract summary: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.<n>We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy.
Score: 2.0393477576774752
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across two small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average accuracy) and provides comparable with distillation performance while using $62\%$ less data ($0.55$ average accuracy for both). We publish our code and data to facilitate further research in this direction.

Related papers

Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets [17.79010397902909]
Existing methods, such as quantized LoRA, are efficient when adapting to a single dataset.<n>We propose an ensemble of multiple smaller adapters instead of a single adapter per task.<n>Our approach provides up to $10%$ higher average test accuracy over QLoRA, with only $9%$ more FLOPs.
arXiv Detail & Related papers (2025-05-28T03:27:08Z)
Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning [18.381178799923514]
Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT)<n>We study how to optimally allocate a fixed training data budget between the two stages.
arXiv Detail & Related papers (2025-02-16T21:57:35Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features [0.30723404270319693]
We introduce a method that has $O(n2)$ runtime and $O(n)$ space complexity, without assuming independence. We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset.
arXiv Detail & Related papers (2024-07-29T11:15:25Z)
Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors [58.661454334877256]
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. We propose $k$NN-DTA, a non-representation embedding-based retrieval method adopted on a pre-trained DTA prediction model.
arXiv Detail & Related papers (2024-07-21T15:49:05Z)
A Specialized Semismooth Newton Method for Kernel-Based Optimal Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples. We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models [19.464992602919015]
Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. We show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.
arXiv Detail & Related papers (2023-05-02T17:56:16Z)
Infinite Recommendation Networks: A Data-Centric Approach [8.044430277912936]
We leverage the Neural Tangent Kernel to train infinitely-wide neural networks to devise $infty$-AE: an autoencoder with infinitely-wide bottleneck layers. We also develop Distill-CF for synthesizing tiny, high-fidelity data summaries. We observe 96-105% of $infty$-AE's performance on the full dataset with as little as 0.1% of the original dataset size.
arXiv Detail & Related papers (2022-06-03T00:34:13Z)
Scaling Up Models and Data with $\ exttt{t5x}$ and $\ exttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models. These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z)
An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z)
FANOK: Knockoffs in Linear Time [73.5154025911318]
We describe a series of algorithms that efficiently implement Gaussian model-X knockoffs to control the false discovery rate on large scale feature selection problems. We test our methods on problems with $p$ as large as $500,000$.
arXiv Detail & Related papers (2020-06-15T21:55:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.