Related papers: Tuning Language Models by Proxy

Tuning Language Models by Proxy

URL: http://arxiv.org/abs/2401.08565v4
Date: Fri, 23 Aug 2024 05:21:44 GMT
Title: Tuning Language Models by Proxy
Authors: Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith,
Abstract summary: We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning.
Score: 110.49482736590907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. We then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. Finally, we show how to proxy-tune a truly black-box LM, GPT-3.5, for temporal adaptation, increasing its knowledge about recent events. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.

Related papers

Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [32.04523360747506]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations. We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z)
MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning [11.174544614042984]
During fine-tuning, large language models (LLMs) may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. We propose a new fine-tuning algorithm termed Momentum-Filtered algorithm (MoFO) MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model.
arXiv Detail & Related papers (2024-07-30T17:38:24Z)
Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback to improve model alignment. We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z)
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method [56.571951345048355]
Large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications. We study whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance.
arXiv Detail & Related papers (2024-02-27T04:18:49Z)
Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation [71.21346469382821]
We introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods.
arXiv Detail & Related papers (2023-12-26T06:31:28Z)
Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory. We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models [43.28607973774104]
Methods for adapting language models (LMs) to new tasks and domains have traditionally assumed white-box access to the model. We present a lightweight method for adapting large LMs to new domains and tasks, assuming no access to their weights or intermediate activations. Our approach fine-tunes a small white-box LM and combines it with the large black-box LM at the probability level through a small network.
arXiv Detail & Related papers (2023-05-23T06:32:55Z)
nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z)
Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side. By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample. We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.