Related papers: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

URL: http://arxiv.org/abs/2506.08060v1
Date: Mon, 09 Jun 2025 08:37:19 GMT
Title: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques
Authors: Asankhaya Sharma,
Abstract summary: Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive.<n>This paper formally proves that capabilities acquired through SFT can be approximated by a base transformer model.<n>We extend these results to practical scenarios with finite context lengths and partial dataset access.
Score: 1.14219428942199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive. This paper formally proves that capabilities acquired through SFT can be approximated by a base transformer model using inference-time techniques, specifically in-context learning (ICL), without altering model parameters, under idealized assumptions including unbounded computational resources and access to the fine-tuning dataset. We extend these results to practical scenarios with finite context lengths and partial dataset access. For text generation tasks with fixed output length $l$, datasets of size $\mathrm{O}\left( \frac{m V}{\varepsilon^2} \log \frac{m}{\delta} \right)$ or, with bounded context, $\mathrm{O}\left( \frac{l \log V}{\varepsilon^2} \log \frac{1}{\delta} \right)$ suffice to approximate fine-tuned behavior across $m$ contexts within error $\varepsilon$, where $V$ is the vocabulary size and $\delta$ is the failure probability. For linear classification, datasets of size $\mathrm{O}\left( \frac{d}{\varepsilon} \right)$ or, with fixed context, $\mathrm{O}\left( \frac{1}{\varepsilon^2} \log \frac{1}{\delta} \right)$ are sufficient, where $d$ is the input dimension. Grounded in the Turing completeness of transformers, these results provide a theoretical foundation for resource-efficient deployment of large language models, with practical techniques like retrieval-augmented generation bridging theory to real-world applications.

Related papers

Robust learning of halfspaces under log-concave marginals [6.852292115526837]
We give an algorithm that learns linear threshold functions and returns a classifier with boundary volume $O(r+varepsilon)$ at radius perturbation $r$.<n>The time and sample complexity of $dtildeO (1/varepsilon2)$ matches the complexity of Boolean regression.
arXiv Detail & Related papers (2025-05-19T20:12:16Z)
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials)<n>We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples.<n>We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning [2.2984209387877628]
We study Lipschitz MDPs in the infinite-horizon average-reward reinforcement learning (RL) setup.<n>We upper bound their regret as $tildemathcalObig(T1 - d_texteff.-1big)$, where $d_texteff. = dPhi_z+2$ for model-free algorithmtextitPZRL-MF and $d_texteff. = 2d_mathcalS + dPhi_z + 3$ for
arXiv Detail & Related papers (2024-05-29T06:18:09Z)
On Understanding Attention-Based In-Context Learning for Categorical Data [49.40350941996942]
We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections.<n>This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations.
arXiv Detail & Related papers (2024-05-27T15:03:21Z)
Overview of Physics-Informed Machine Learning Inversion of Geophysical Data [1.5108390071719444]
We review four types of algorithms for physics-informed machine learning (PIML) inversion of geophysical data. The selection of weights $lambda_i$ and the NN operations determine one of four different PIML algorithms.
arXiv Detail & Related papers (2023-10-12T08:10:31Z)
Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing [8.723136784230906]
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets. Our objective is to minimize the cumulative deviation of the generated parameters $thetai(t)_t=0T$ across all $T$ iterations. By leveraging symmetries within the regret-optimal algorithm, we develop a nearly regret $_optimal that runs with $mathcalO(Np2)$ fewer elementary operations.
arXiv Detail & Related papers (2023-09-08T19:17:03Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z)
Supervised Training of Conditional Monge Maps [107.78770597815242]
Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. We introduce CondOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable. We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells.
arXiv Detail & Related papers (2022-06-28T19:34:44Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
An Algorithm for Learning Smaller Representations of Models With Scarce Data [0.0]
We present a greedy algorithm for solving binary classification problems in situations where the dataset is too small or not fully representative. It relies on a trained model with loose accuracy constraints, an iterative hyperparameter pruning procedure, and a function used to generate new data.
arXiv Detail & Related papers (2020-10-15T19:17:51Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
Convergence of Sparse Variational Inference in Gaussian Processes Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z)
Few-Shot Learning via Learning the Representation, Provably [115.7367053639605]
This paper studies few-shot learning via representation learning. One uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task.
arXiv Detail & Related papers (2020-02-21T17:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.