Few-shot in-context learning (ICL) enables pre-trained language models to
perform a previously-unseen task without any gradient-based training by feeding
a small number of training examples as part of the input. ICL incurs
substantial computational, memory, and storage costs because it involves
processing all of the training examples every time a prediction is made.
Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse
update methods, etc.) offers an alternative paradigm where a small set of
parameters are trained to enable a model to perform the new task. In this
paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning
and demonstrate that the latter offers better accuracy as well as dramatically
lower computational costs. Along the way, we introduce a new
parameter-efficient fine-tuning method called (IA)$^3$ that scales activations
by learned vectors, attaining stronger performance while only introducing a
relatively tiny amount of new parameters. We also propose a simple recipe based
on the T0 model called T-Few that can be applied to new tasks without
task-specific tuning or modifications. We validate the effectiveness of T-Few
on completely unseen tasks by applying it to the RAFT benchmark, attaining
super-human performance for the first time and outperforming the
state-of-the-art by 6% absolute. All of the code used in our experiments is
publicly available.
Department of Computer Science University of North Carolina at Chapel Hill
計算機科学専攻 ノースカロライナ大学チャペルヒル校
0.61
{haokunl,dtredsox,muq eeth,craffel}@cs.unc.edu
haokunl,dtredsox,muq eeth,craffel}@cs.unc.edu
0.46
Abstract Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input.
ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
Parameter-efficient fine-tuning (e g adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
Along the way, we introduce a new parameter-efficient fine-tuning method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters.
We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.
All of the code used in our experiments is publicly available.1
私たちの実験で使われたコードはすべて公開されています。
0.57
Introduction 1 Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest – i.e., using a pre-trained language model for initialization often produces better results with less labeled data.
A historically common approach has been to use the pre-trained model’s parameters for initialization before performing gradient-based fine-tuning on a downstream task of interest.
While fine-tuning has produced many state-of-the-art results [1], it results in a model that is specialized for a single task with an entirely new set of parameter values, which can become impractical when fine-tuning a model on many downstream tasks.
An alternative approach popularized by [3, 4] is in-context learning (ICL), which induces a model to perform a downstream task by inputting prompted examples.
Few-shot prompting is the process of converting a small collection of input-target pairs into (typically) human-understandable instructions and examples [3, 4], along with a single unlabeled example for which a prediction is desired.
Figure 1: Diagram of (IA)3 and the loss terms used in the T-Few recipe.
図1: (IA)3のダイアグラムとT-Fewレシピで使われる損失項。
0.78
Left: (IA)3 introduces the learned vectors lk, lv, and lff which respectively rescale (via element-wise multiplication, visualized as (cid:12)) the keys and values in attention mechanisms and the inner activations in position-wise feed-forward networks.
Right: In addition to a standard cross-entropy loss LLM, we introduce an unlikelihood loss LUL that lowers the probability of incorrect outputs and a length-normalized loss LLN that applies a standard softmax cross-entropy loss to length-normalized log-probabilities of all output choices.
Second, in-context learning typically produces inferior performance compared to fine-tuning [4].
第二に、コンテキスト内学習は通常、微調整[4]に比べてパフォーマンスが劣る。
0.58
Finally, the exact formatting of the prompt (including the wording [11] and ordering of examples [12]) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation observed when performing fine-tuning.
Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all [9].
An additional paradigm for enabling a model to perform a new task with minimal updates is parameterefficient fine-tuning (PEFT), where a pre-trained model is fine-tuned while only updating or adding a small number of parameters rather than all of the model’s parameters.
Recent methods have shown that it is possible to match the performance of fine-tuning the full model while only updating or adding a small fraction (e g 0.01%) of the full model’s parameters [13, 14].
Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently [14], making both PEFT and ICL viable approaches for multitask models.
While the benefits of PEFT begin to address some of the shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available.
Our primary goal in this paper is to close this gap by proposing a recipe – i.e., a model, a PEFT method, and a fixed set of hyperparameters – that attains strong performance on novel, unseen tasks while only updating a tiny fraction of the model’s parameters.
Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning [18, 19].
最後に,(ia)3パラメータを[18,19]に微調整する前に事前トレーニングする利点を示す。
0.82
Our overall recipe, which we dub “T-Few”, attains significantly stronger performance than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference.
Susie called her grandma and asked her to send some.
スージーはおばあちゃんに電話して送ってくれと頼んだ。
0.57
Grandma lived very far away.
おばあちゃんは遠くに住んでいた。
0.43
A week passed and grandma surprised Susie by coming to visit.
1週間が経ち、おばあちゃんはスージーを訪ねて驚かせた。
0.57
What is a possible continuation for the story?
その物語の継続可能性は何ですか。
0.67
Susie was so happy.Susie was upset.
スージーは幸せだった スージーは動揺してた
0.51
(IA)3Losses used in T-Few
(IA)3-Fewにおける損失
0.48
英語(論文から抽出)
日本語訳
スコア
2 Background In this section, we provide a brief overview of in-context learning and parameter-efficient fine-tuning, with a focus on characterizing the costs of each method.
The realworld costs can vary somewhat depending on implementation and hardware, so we characterize costs in terms of FLOPs for computation and and bytes for memory and storage, respectively.
2.1 Few-shot in-context learning In-context learning (ICL), introduced and popularized by Radford et al [3] and Brown et al [4], aims to induce a model to perform a task by feeding in concatenated and prompted input-target examples (called “shots”) along with an unlabeled query example.
2.1Radford et al [3] と Brown et al [4] が導入し、普及した文脈内学習 (ICL) は、連結された入力ターゲットの例("shots" と呼ばれる)をラベルなしクエリの例とともにフィードすることでタスクを実行するモデルを誘導することを目的としている。
0.78
Taking the cycled letter task from Brown et al [4] as an example, a 4-shot input or context would be “Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro =”, for which the desired output would be “roasted”.
例えば、Brown et al [4] のサイクルされた文字タスクを例にとると、4ショットの入力またはコンテキストは"Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro =" となり、そこで所望の出力が"ロースト"される。
0.86
ICL induces an autoregressive language model to perform this task by feeding in the context and sampling from the model.
For classification tasks, each label is associated with a string (e g “positive” and “negative” for sentiment analysis) and a label is assigned by choosing the label string that the model assigns the highest probability to.
For multiple-choice tasks (e g choosing between N possible answers to a question), the model’s prediction is similarly determined by determining which choice is assigned the highest probability.
This also enables mixed-task batches, where different examples in a batch of data can correspond to different tasks by using different contexts in the input.
ICL is also typically performed with only a limited number of labeled examples – called few-shot learning – making it a data-efficient way of enabling a model to perform a task.
Despite these advantages, ICL comes with significant practical drawbacks: First, the need for the model to process labeled examples in the context before it makes each prediction dramatically increases the computational cost compared to processing the unlabeled example alone.
Specifically, ignoring the quadratic complexity of self-attention operations in Transformer language models (which are typically small compared to the costs of the rest of the model [20]), processing the k training examples for k-shot ICL increases the computational cost by approximately k + 1 times compared to processing the unlabeled example alone.
具体的には、トランスフォーマー言語モデル(通常モデル[20]の他のコストと比べて小さい)における自己注意操作の二次的な複雑さを無視して、kショットICLのkトレーニング例を処理することで、ラベルなし例のみの処理に比べて約 k + 1 倍の計算コストが増大する。
0.74
Memory costs similarly scale approximately linearly with k, though during inference the memory costs are typically dominated by storing the model’s parameters.
Separately, there is a small amount of on-disk storage required for storing the in-context examples for a given task.
別として、タスクのインコンテキストの例を保存するのに必要なオンディスクストレージは少ない。
0.71
For example, storing 32 examples for a task where the prompted input and target for each example is 512 tokens long would require about 66 kilobytes of storage on disk (32 examples × 512 tokens × 32 bits).
Beyond the aforementioned costs, it has been found that ICL has unintuitive behavior.
上記のコストを超えると、iclは直観的でない行動をとることが判明した。
0.55
For example, Zhao et al [12] showed that the ordering of examples in the context heavily influences the model’s predictions.
例えば、Zhao et al [12] は、文脈における例の順序付けがモデルの予測に大きな影響を及ぼすことを示した。
0.80
Min et al [9] showed that ICL can still perform well even if the labels of the in-context examples are swapped (i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled examples or not.
Min氏ら[9]は、インコンテキストの例のラベルがスワップされたとしても、ICLは依然としてうまく機能することを示した。 訳抜け防止モード: min et al [9 ] は icl がうまく機能することを示した。 in - コンテキストの例のラベルは交換される(つまり、正しくない)。 これは、ラベル付きの例からiclが本当に“学習”しているかどうか、という疑問を提起する。
0.63
Various approaches have been proposed to mitigate these issues.
これらの問題を解決するために様々なアプローチが提案されている。
0.52
One way to lower the computational costs of ICL is to exploit the fact that decoder-only Transformer language models have a causal masking pattern, so the model’s activations for the context do not change when the unlabeled example changes.
In an extreme case, 32-shot ICL with 512 input and target tokens per in-context example would result in over 144 gigabytes of cached key and value vectors for the GPT-3 model (32 examples × 512 tokens × 96 layers × 12288 dmodel × 32 bits each for the key and value vectors).
Storing these cached values on disk would therefore incur nontrivial storage costs.
これらのキャッシュされた値をディスクに保存すると、非自明なストレージコストが発生する。
0.53
Separately, Min et al [21] proposed ensemble ICL, where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together.
別途、min et al [21] はアンサンブル icl を提案し、k の訓練例の連結から出力確率を使用する代わりに、各訓練例(すなわち k の例ごとに 1-shot icl )におけるモデルの出力確率を乗算する。
0.79
This lowers the memory cost by a factor of k/2 but increases the computational cost by a factor of 2.
これにより、メモリコストはk/2倍に削減されるが、計算コストは2倍に増加する。
0.65
In terms of task performance, Min et al [21] find that ensemble ICL outperforms the standard concatenative variant.
タスクパフォーマンスに関しては、min et al [21] はアンサンブルiclが標準の結合型よりも優れていることを見出している。
0.49
3
3
0.42
英語(論文から抽出)
日本語訳
スコア
2.2 Parameter-efficient fine-tuning While standard fine-tuning updates all parameters of the pre-trained model, it has been demonstrated that it is possible to instead update or add a relatively small number of parameters during fine-tuning.
Early methods proposed adding adapters [22–24], which are small feed-forward networks inserted between the layers in the pre-trained model whose parameters are updated during fine-tuning while the remainder of the pre-trained model is left fixed.
Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train [25, 26], produce low-rank updates [13], perform optimization in a lower-dimensional subspace [27], add low-rank adapters using hypercomplex multiplication [28], and more.
Relatedly, prompt tuning [14] concatenates learned continuous embeddings to the model’s input to induce it to perform a task and can be seen as a PEFT method [29].
State-of-the-art PEFT methods can match the performance of fine-tuning all of the model’s parameters while updating only a tiny fraction (e g 0.01%) of the model’s parameters.
A primary advantage of PEFT is that it drastically reduces the storage requirements for fine-tuned models.
PEFTの主な利点は、微調整モデルのストレージ要求を大幅に削減できる点である。
0.73
In addition, certain PEFT methods straightforwardly allow mixed-task batches – for example, prompt tuning enables a single model to perform many tasks simply by concatenating different prompt embeddings to each example in the batch [14].
On the other hand, other PEFT methods such as those that use sparse or low-rank updates do not make mixed-task batches convenient because they require a different set of parameters for each task.
Separately, different PEFT methods increase the computation and memory required to perform inference by different amounts.
異なるPEFT法は、異なる量で推論を行うために必要な計算量とメモリを増加させる。
0.66
For example, adapters effectively add additional (small) layers to the model, resulting in small but non-negligible increases in computational costs and memory.
An additional cost incurred by PEFT is the cost of fine-tuning itself, which must be performed once and is then amortized as the model is used for inference.
However, we will demonstrate that the increase in computation cost incurred by PEFT methods for fine-tuning and during inference is a small proportion of the inference cost required for ICL.
Additionally, we will show that PEFT can be dramatically more computationally efficient during inference while achieving better accuracy than ICL.
さらに,PEFTはICLよりも精度が高く,推論時に計算効率が劇的に向上することを示す。
0.70
3 Designing the T-Few Recipe Given that PEFT allows a model to be adapted to a new task with relatively small storage requirements and computational cost, we argue that PEFT presents a promising alternative to ICL.
Our goal is therefore to develop a recipe that allows a model to attain high accuracy on new tasks with limited labeled examples while allowing mixed-task batches during inference and incurring minimal computational and storage costs.
By recipe, we mean a specific model and hyperparameter setting that provides strong performance on any new task without manual tuning or per-task adjustments.
In this way, we can ensure that our approach is a realistic option in few-shot settings where limited labeled data is available for evaluation [30, 31].
In preliminary experiments applying PEFT methods to different pre-trained models, we attained the best performance with T0 [1].
異なる事前学習モデルにPEFT法を適用した予備実験では,T0[1]で最高の性能を得た。
0.77
T0 is based on T5 [15], an encoder-decoder Transformer model [32] that was pre-trained via a masked language modeling objective [33] on a large corpus of unlabeled text data.
T0 was created by fine-tuning T5 on a multitask mixture of datasets in order to enable zero-shot generalization, i.e. the ability to perform tasks without any additional gradient-based training.
Examples in the datasets used to train T0 were prompted by applying the prompt templates from the Public Pool of Prompts (P3 [34]), which convert each example in each dataset to a prompted text-to-text format where each label corresponds to a different string.
For all models and experiments, we use Hugging Face Transformers [35].
すべてのモデルと実験のために、Hugging Face Transformer[35]を使用します。
0.72
While T0 was designed for zero-shot generalization, we will demonstrate that it also attains strong performance after fine-tuning with only a few labeled examples.
To test T0’s generalization, Sanh et al [1] chose a set of tasks (and corresponding datasets) to hold out from the multitask training mixture – specifically, sentence completion (COPA [36], H-SWAG [37], and Story Cloze [38] datasets),
t0 の一般化をテストするため、sanh と al [1] は、マルチタスクトレーニング混合物(特に、文補完(copa [36], h-swag [37], story cloze [38] データセット)から保持する一連のタスク(および対応するデータセット)を選択した。
0.80
4
4
0.42
英語(論文から抽出)
日本語訳
スコア
natural language inference (ANLI [39], CB [40], and RTE [41]), coreference resolution (WSC [42] and Winogrande [43]), and word sense disambiguation (WiC [44]).
Evaluation of generalization capabilities can then be straightforwardly done by measuring performance on these held-out datasets.
一般化能力の評価は、これらの保持されたデータセットのパフォーマンスを測定することで、簡単に行える。
0.54
We also will later test T-Few’s abilities in the RAFT benchmark [2] in section 4.3, a collection of unseen “real-world” few-shot tasks with no validation set and a held-out test set.
また、第4.3節でT-Fewの能力をRAFTベンチマーク[2]でテストします。 訳抜け防止モード: また、後日、セクション4.3のRAFTベンチマーク[2 ]でT - の能力をテストします。 見えない「現実の世界」のごく少数のショットタスクの集合 and a held-out test set.
0.73
To ease comparison, we use the same number of few-shot training examples for each dataset as Brown et al [4], which varies from 20 to 70.
比較を容易にするために、各データセットに対して、20から70まで変化するBrown et al [4]と同じ数である、数発のトレーニング例を使用します。 訳抜け防止モード: 比較しやすくする。 データセット毎に、Brown et al [ 4 ] と同じ数の - ショットトレーニングの例を使用します。 20から70まで様々です
0.79
Unfortunately, the few-shot dataset subsets used by Brown et al [4] have not been publicly disclosed.
残念なことに、brown et al [4] が使っている数少ないデータセットサブセットは、公開されていない。
0.58
To allow for a more robust comparison, we therefore constructed five few-shot datasets by sampling subsets with different seeds and report the median and interquartile range.
We prompt examples from each dataset using the prompt templates from P3 Bach et al [34], using a randomly-sampled prompt template for each example at each step.
p3 bach et al [34]のプロンプトテンプレートを使用して、各ステップでランダムにサンプリングされたプロンプトテンプレートを使用して、各データセットから例をプロンプトします。
0.66
Unless otherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the end of training.
For evaluation, we use “rank classification”, where the model’s log-probabilities for all possible label strings are ranked and the model’s prediction is considered correct if the highest-ranked choice is the correct answer.
Rank classification evaluation is compatible with both classification and multiplechoice tasks.
ランク分類評価は、分類タスクと多重選択タスクの両方と互換性がある。
0.57
Since model performance can vary significantly depending on the prompt template used, we report the median accuracy across all prompt templates from P3 and across few-shot data subsets for each dataset.
For all tasks and datasets, we report the accuracy on the test set or validation set in the event that the test labels are not public (e g on all SuperGLUE tasks).
In the main text, we report median accuracy across the nine datasets mentioned above.
本文では,上記の9つのデータセットの中央値の精度について報告する。
0.64
Detailed results on each individual dataset are reported in the appendices.
各データセットの詳細な結果は付録に記載されている。
0.73
3.2 Unlikelihood Training and Length Normalization Before investigating PEFT methods, we first explore two additional loss terms to improve the performance of few-shot fine-tuning of language models.
Language models are normally trained with cross-entropy loss:
言語モデルは、通常、クロスエントロピー損失で訓練される。
0.58
log p(yt|x, y<t)
log p(yt|x, y<t)
0.49
(1) LLM = − 1 T
(1) LLM = − 1 T
0.43
T(cid:88) t=1
T(第88回) t=1 である。
0.45
where the model is trained to increase the probability of the correct target sequence y = (y1, y2, . . . , yT ) given the input sequence x.
ここでは、入力シーケンス x を与えられた正しいターゲットシーケンス y = (y1, y2, . . . . , yT ) の確率を上げるよう訓練する。
0.79
For evaluation, we use rank classification (described in section 3.1) which depends on both the probability that the model assigns to the correct choice as well as the probabilities assigned by the model to the incorrect choices.
(2) which discourages the model from predicting tokens from incorrect target sequences, where ˆy(n) = (ˆy1, ˆy2, . . . , ˆyT (n)) is the n-th of N incorrect target sequences.
We hypothesize that adding LUL will improve results on rank classification because the model will be trained to assign lower probabilities to incorrect choices, thereby improving the chance that the correct choice is ranked highest.
To rectify this, we consider using length normalization when performing rank classification, which divides the model’s score on each possible answer choice by the number of tokens in the choice (as used in GPT-3 [4]).
When using length normalization during evaluation, we introduce an additional loss term during training that more closely reflects length-normalized evaluation: First, we compute the length-normalized log probability of a given output sequence as:
Then, we maximize the length-normalized log probability of the correct answer choice via a standard softmax cross-entropy loss:
次に、標準ソフトマックスクロスエントロピー損失により、正答選択の長さ正規化ログ確率を最大化する。
0.74
LLN = − log
LLN = − log
0.42
exp(β(x, y)) +(cid:80)N
exp(β(x, y)) +(cid:80)N
0.48
exp(β(x, y))
exp(β(x, y))
0.42
n=1 exp(β(x, ˆy(n)))
n=1 exp(β(x, sy(n)))
0.95
(4) When training a model with LLM, LUL, and LLN, we simply sum them.
(4) LLM、LUL、LLNでモデルをトレーニングする場合、単純にそれらを要約する。
0.61
This avoids introducing any hyperparameters that would be problematic to tune in the few-shot setting (where realistically-sized validation sets are tiny by necessity [30, 31]).
We report the results of fine-tuning all of T0-3B’s parameters with and without length normalization on all datasets in appendix C. We find that adding LLN improves the accuracy from 60.7% to 62.71% and including both LUL and LLN provides a further improvement to 63.3%.
Since these loss terms improve performance without introducing any additional hyperparameters, we include them in our recipe and use them in all following experiments.
3.3 Parameter-efficient fine-tuning with (IA)3 In order to compare favorably to few-shot ICL, we need a PEFT method that has the following properties: First, it must add or update as few parameters as possible to avoid incurring storage and memory costs.
Otherwise, each example in a batch would effectively need to be processed by a different model or computational graph.
そうでなければ、バッチの各例は、異なるモデルや計算グラフによって効果的に処理される必要がある。
0.61
A more convenient alternative is provided by methods that directly modify the activations of the model since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to.
Prompt tuning and prefix tuning methods [14, 45] work by concatenating learned vectors to activation or embedding sequences and are therefore examples of activation-modifying PEFT methods that allow for mixed-task batches.
However, as we will discuss later, we were unable to attain reasonable accuracy with prompt tuning and found that the more performant PEFT methods did not allow for mixed-task batches.
Specifically, we consider adaptation of the form l (cid:12) x where l ∈ Rd is a learned task-specific vector, (cid:12) represents element-wise multiplication, and x ∈ RT×d is a length-T sequence of activations.
具体的には、l ∈ Rd を学習されたタスク固有ベクトルとし、(cid:12) を要素ワイド乗算を表し、x ∈ RT×d をアクティベーションの長さ-T 列とする l (cid:12) x の適応を考える。
0.67
We use “broadcasting notation” [46] so that the (i, j)th entry of l(cid:12)x is ljxi,j.
In preliminary experiments, we found it was not necessary to introduce a learned rescaling vector for each set of activations in the Transformer model.
Instead, we found it was sufficient to introduce rescaling vectors on the keys and values in self-attention and encoder-decoder attention mechanisms and on the intermediate activation of the position-wise feed-forward networks.
Specifically, using the notation from Vaswani et al [32], we introduce three learned vectors lk ∈ Rdk, lv ∈ Rdv, and lff ∈ Rdff , which are introduced into the attention mechanisms as: (lv (cid:12) V )
We introduce a separate set of lk, lv, and lff vectors in each Transformer layer block.
各変圧器層ブロックに lk, lv, lff ベクトルの別セットを導入する。
0.70
This adds a total of L(dk + dv + dff ) new parameters for a L-layer-block Transformer encoder and L(2dk + 2dv + dff ) (with factors of 2 accounting for the presence of both self-attention and encoder-decoder attention) for a L-layer-block decoder.
We refer to our method as (IA)3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”.
本手法は,「内的活動の抑制と増幅による拡散適応」を意味する (IA)3 と呼ぶ。
0.63
(IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector.
We also note that, in the event that a model will only be used on a single task, the modifications introduced by (IA)3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the model’s architecture remains unchanged.
In this case, our method incurs no additional computational cost compared to the original model.
この場合,本手法では,原モデルと比較して計算コストが増大しない。
0.77
To validate (IA)3, we compare it to a large variety of existing adaptation methods in our setting of fine-tuning T0-3B on few-shot datasets from held-out tasks.
Specifically, we compare against eight strong baseline methods: BitFit [47] which updates only the bias parameters; Adapters [23] which introduce task-specific layers after the self-attention and position-wise feed-forward networks; Compacter and Compacter++ [28] which improve upon adapters by using low-rank matrices and hypercomplex multiplication; prompt tuning [14] which learns task-specific prompt embeddings that are concatenated to the model’s input; FISH Mask [26] which chooses a subset of parameters to update based on their approximate Fisher information; Intrinsic SAID [27] which performs optimization in a low-dimensional subspace; and LoRA [13] which assigns low-rank updates to parameter matrices.
Specifically, we compare against eight strong baseline methods: BitFit [47] which updates only the bias parameters; Adapters [23] which introduce task-specific layers after the self-attention and position-wise feed-forward networks; Compacter and Compacter++ [28] which improve upon adapters by using low-rank matrices and hypercomplex multiplication; prompt tuning [14] which learns task-specific prompt embeddings that are concatenated to the model’s input; FISH Mask [26] which chooses a subset of parameters to update based on their approximate Fisher information; Intrinsic SAID [27] which performs optimization in a low-dimensional subspace; and LoRA [13] which assigns low-rank updates to parameter matrices. 訳抜け防止モード: 具体的には、バイアスパラメータのみを更新するbitfit [ 47 ]、タスク固有のレイヤを自己の後に導入する adapter [ 23 ] という、8つの強力なベースラインメソッドと比較する。 -注意・位置- wise feed - forward networks ; compacter++ と compacter++ [28] 低階行列と超複素乗算によるアダプタの改善 タスクを学習するプロンプトチューニング [14 ] - モデル入力に連結された特定のプロンプト埋め込み ; fish mask [26 ] 近似フィッシャー情報に基づいて更新するパラメータのサブセットを選択する ; 低次元部分空間で最適化を行う内在的に言う[27]; パラメータ行列に低ランク更新を割り当てるlora [13]。
0.86
Additionally, we include the simple baselines of full-model fine-tuning and updating only the layer normalization parameters.
For certain methods that allow changing the number of parameters updated or added, we report results for different parameter budgets: 0.2% and 0.02% sparsity for FISH Mask, 10 and 100 learned prompt vectors for prompt tuning, and 20,000- or 500,000-dimensional subspaces for Intrinsic SAID.
2, with detailed per-dataset results in appendix D. We find that (IA)3 is the only method that attains higher accuracy than the full-model-fine-tuning baseline.
Our results and setting differ with some past work on the PEFT methods we compare against.
我々の結果と設定は、私たちが比較したPEFT法に関する過去の研究と異なる。
0.64
Mahabadi et al [28] report that Compacter and Compacter++ outperform full-model fine-tuning, including in the few-shot setting.
mahabadi氏ら[28]は、コンパクトでコンパクトな++は、少数の設定を含むフルモデルの微調整よりも優れていると報告している。 訳抜け防止モード: mahabadi et al [28]その報告 compacterとcompacter++はフル-モデルファイン-チューニングに勝っている。
0.64
Lester et al [14] found that prompt tuning could match full-model fine-tuning, and in subsequent work Wei et al [48] found that prompt tuning performed well when applied to a multitask fine-tuned model in the few-shot setting.
lester et al [14] は、プロンプトチューニングがフルモデルの微調整と一致することを見出し、その後の研究でwei et al [48] は、少数ショット設定でマルチタスクの微調整モデルに適用すると、プロンプトチューニングがうまく機能することを発見した。
0.60
In both cases, we experimented with various hyperparameter choices to try to match past results.
いずれの場合も,過去の結果と一致させるために,様々なハイパーパラメータの選択を試みた。
0.64
We hypothesize the disagreement comes from us using a different model and different datasets.
異なるモデルと異なるデータセットを使って、意見の不一致を仮説化します。
0.67
For prompt tuning specifically, we noticed that the validation set performance could fluctuate wildly over the course of training, hinting at possible optimization issues.
Figure 2: Accuracy of different parameterefficient methods when applied to few-shot finetuning of T0-3B.
図2:T0-3Bの少数ショット微調整に適用した場合のパラメータ係数の異なる手法の精度。
0.58
Methods that were evaluated using different parameter budgets are represented with larger and smaller markers representing more or less parameters updates.
7 Figure 3: Accuracy of different few-shot learning methods.
7 図3: 異なる少数ショット学習方法の精度。
0.59
T-Few uses (IA)3 for parameterefficient fine-tuning of T0, T0 uses zero-shot learning, and T5+LM and the GPT-3 variants use few-shot in-context learning.
The x-axis corresponds to inference costs; details are provided in section 4.2.
x軸は推論コストに対応しており、詳細はセクション4.2に記載されている。
0.54
BF=H=AJAHIKF@=JA@###$$#)??
BF=H=AJAHIKF@=JA@###$$#)??
0.37
KH=? O)F=H=AJAHI1) 4)*EJ.EJ=OAHH+F=?
KH=? ej.ej,ej.ej.=oah,h,+,f=?
0.32
JAH+F=? JAH2HFJ6KEC)@=FJAH.150=I1JHEIE?
JAH+aF=? 略称:jahh.150)。
0.48
5)1, !
5)1, !
0.99
"#. 2IFAHAN=FA###$$#%)??
"#. 略して「##」は「##」の意)。
0.31
KH=? O6. AM66#/26!
KH=? O6。 6月6日 - AM6!
0.45
$%*/26!
$%*/26!
0.47
! */26!
! */26!
0.41
%#*
%#*
0.42
英語(論文から抽出)
日本語訳
スコア
3.4 Pre-training (IA)3 In recent work, Gu et al [18], Vu et al [19] showed that pre-training the prompt embeddings in prompt tuning can improve performance when fine-tuning on downstream few-shot tasks.
3.4 事前トレーニング(ia)3 最近の研究で、gu et al [18], vu et al [19] は、プロンプト埋め込みをプロンプトチューニングで事前トレーニングすることで、下流の少数タスクの微調整時のパフォーマンスが向上することを示した。
0.60
For pretraining, Gu et al [18] use a suite of self-supervised tasks applied to unlabeled text data, and Vu et al [19] consider using embeddings from a separate task or multitask mixture.
事前トレーニングのために、gu et al [18] はラベルなしのテキストデータに適用された一連の自己監督タスクを使用し、vu et al [19] は別のタスクまたはマルチタスクの混合物からの埋め込みの使用を検討します。
0.62
We follow Vu et al [19] and simply pre-train the new parameters introduced by (IA)3 on the same multitask mixture used to train T0.
Vu et al [19]に従い、(IA)3で導入された新しいパラメータをT0のトレーニングに使用するのと同じマルチタスクミックスで事前トレーニングする。
0.71
We pre-train for 100,000 steps with a batch size of 16 before fine-tuning the (IA)3 parameters on each individual downstream dataset.
A full comparison of accuracy with and without pre-training (IA)3 is detailed in E. We find that pre-training improves fine-tuned accuracy from 64.6 to 65.8 and therefore add it to our recipe.
As an objective, we use the sum of a standard language modeling loss LLM, an unlikelihood loss LUL for incorrect choices, and a lengthnormalized loss LLN.
We train for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [49] with a learning rate of 3e−3 and a linear decay schedule with a 60-step warmup.
4 Outperforming ICL with T-Few Having designed and established the T-Few recipe on the T0-3B model, we now apply it to T0 (with eleven billion parameters) and compare performance to strong few-shot ICL methods.
T0. To measure the improvement in performance conferred through parameter-efficient few-shot learning, we compare to zero-shot evaluation using T0 itself.
In preliminary experiments, we found that T0 was not able to perform few-shot in-context learning – performance actually decreased as we increased the number of in-context examples.
T5+LM. Since T0 is unable to perform in-context learning on its own, we also compare to T5+LM, the next-step-prediction language model upon which T0 is based.
Specifically, we use the LM-adapted variant of T5.1.1.xxl released by Lester et al [14], which has the same architecture and number of parameters as T0.
具体的には、Lester et al [14] がリリースした LM 適応型 T5.1.xxl を使用し、T0 と同じアーキテクチャとパラメータ数を持つ。
0.75
Due to memory constraints and because of its improved performance, we use ensemble ICL for T5+LM [6].
メモリの制約と性能の改善のため、私たちはT5+LM [6] にアンサンブル ICL を使用します。
0.74
Specifically, we perform one-shot ICL using each example in the training set individually and average the predictions for a given query example.
GPT-3. For a strong ICL baseline, we consider models in the GPT-3 family [4].
GPT-3。 強いiclベースラインについては、gpt-3ファミリーのモデルを考察する [4]。
0.51
Specifically, we compare to the 6.7, 13, and 175 billion parameter variants of GPT-3.
具体的には、GPT-3の6.7、13、175億のパラメータの変種と比較する。
0.62
Because these models have not been publicly released, we report numbers directly from Brown et al [4].
これらのモデルは公開されていないので、Brown ら [4] から直接数値を報告します。
0.72
While GPT-3 is available through the commercial OpenAI API, re-running evaluation through the API would be more than an order of magnitude more expensive than running all of the experiments performed for this paper.
The accuracy on the held-out T0 datasets (described in section 3.1) is shown in table 1 and fig.
保持されたT0データセット(セクション3.1)の精度をテーブル1及びフィグに示す。
0.81
3, with per-dataset results reported in appendix F. We find that T-Few outperforms all other methods by a substantial margin.
我々は、T-Fewが他のすべてのメソッドよりもかなりのマージンで優れていることを発見した。
0.52
Notably, T-Few achieves a 6% higher accuracy than few-shot ICL with GPT-3 175B despite being about 16× smaller and outperforms the smaller GPT-3 variants by an even larger margin.
4.2 Comparing computational costs Having established that T-Few significantly outperforms ICL-based models, we now compare the relative costs of each few-shot learning approach.
Specifically, we estimate that a decoder-only Transformer (e g the GPT series) with N parameters uses 2N FLOPs per token for inference and 6N FLOPs per token for training.
Encoder-decoder models like T0 and T5 (where the encoder and decoder have the same number of layers and layer sizes) only process each token with either the encoder or decoder (each having roughly half the parameters of the full model), so the FLOPs per token estimates are halved to N and 3N FLOPs per token for inference and training.
We note that FLOPs are not a direct measurement of real-world computational cost because latency, power usage, and other costs can vary significantly depending on hardware and other factors [52].
However, we focus on FLOPs because it is a hardware-independent metric that closely with real-world costs the hardware setup used for running the different methods we consider would likely vary significantly across methods.
Processing a single input and all target choices with T-Few requires 11e9× 103 = 1.1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2× 175e9× (41 × 98 + 103) = 1.4e15 FLOPs – more than 3 orders of magnitude more.
As discussed in section 2.1, caching the key and value vectors when the same set of in-context examples is to be reused can reduce the computational cost of ICL.
Training an eleven billion parameter encoder-decoder model for 1,000 steps with a batch size of 8 length-103 sequences requires approximately 3 × 11e9 × 1, 000 × 8 × 103 = 2.7e16 FLOPs.
While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot in-context learning using GPT-3 175B.
When stored as single-precision floats, the Storage cost.
単精度フロートとして保存すると、ストレージコストがかかる。
0.57
parameters added by (IA)3 take up 4.2 MB of space on disk.
(ia)3で追加されたパラメータはディスク上の4.2mbのスペースを取る。
0.68
In contrast, ICL methods only require storing the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller 41 × 98 × 32 bits = 16 kB disk space requirement.
However, we note that 4.2 MB is dwarfed by the on-disk size of the model checkpoints themselves – storing the (IA)3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint (41.5 GB).
Additional memory costs are incurred when training T-Few due to the need to cache intermediate activations for backpropagation and for the gradient accumulator variables in Adafactor.
However, as mentioned above, it is possible to use the T-Few recipe on a single 80GB A100 GPU.
しかし、前述のように、T-Fewレシピを単一の80GB A100 GPUで使用できる。
0.78
4.3 Performance on Real-world Few-shot Tasks (RAFT) So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning.
RAFT consists of 11 “economically valuable” tasks that aim to mirror real-world applications.
RAFTは11の“経済的に価値のある”タスクで構成される。
0.64
Importantly, each RAFT datasets has only 50 training examples with no validation set and a (larger) test set with no public labels, so it is impossible to “cheat” by tuning on an unrealistically-larg e validation set or by peeking at the test set [31, 30].
We apply T-Few to RAFT by using the standard prompts released alongside the dataset.
RAFTにT-Fewを適用し、データセットとともにリリースされた標準プロンプトを利用する。
0.58
The accuracy of the current top-5 methods is shown in table 2, with further details provided in appendix H. T-Few attains a state-of-the-art accuracy of 75.8% and outperforms the human baseline (73.5% accuracy) for the first time.
We experiment with omitting the step of pre-training (IA)3 and removing unlikelihood training and length normalization Detailed results are shown in appendix G. We confirm that each of the ingredients provides a boost in accuracy: Removing pre-training decreases accuracy by 1.6%, and removing both pre-training and our additional loss terms reduces accuracy by an additional 2.5%.
Liu et al [54] introduce several tricks to improve prompt tuning, An et al [55] tune prompts along with input embeddings for boost in performance, and Chen et al [56] improve prompt embeddings through continued pre-training.
liu et al [54] はプロンプトチューニングを改善するいくつかのトリックを導入し、et al [55] はパフォーマンス向上のためのインプット埋め込みと共にプロンプトをチューニングし、chen et al [56] はプレトレーニングを継続することでプロンプト埋め込みを改善した。
0.66
Given optimization difficulties when training prompt embeddings, Diao et al [57] recently used black-box optimization to train prompt embeddings without requiring gradients.
Several works have analyzed prompt tuning from the perspective of interpretability Khashabi et al [58] and its similarity to other PEFT methods He et al [29].
いくつかの作品は、khashabi et al [58] と he et al [29] の他のペフト法との類似性の観点から、即興調律を解析している。 訳抜け防止モード: Khashabi et al [58]の解釈可能性の観点から、いくつかの作品が即時チューニングの分析を行った そして、他のPEFTメソッドと類似しています。
0.69
Prompt tuning has been applied to various applications for NLP including continual learning [59], model robustness [60, 61], summarization [62], machine translation [63], co-training [64], probing language models [65, 65], inverse prompting [66], and transfer learning [67].
He et al [68] recently proposed the use of a hypernetwork to predict prompts for new tasks (rather than training the prompt parameters with gradient descent).
He et al [68]は、最近、新しいタスクのプロンプトを予測するためのハイパーネットワークの使用を提案した(勾配降下を伴うプロンプトパラメータのトレーニングではなく)。 訳抜け防止モード: He et al [68 ] は先日,ハイパーネットワークの利用を提案した。 新しいタスクのプロンプト(勾配降下によるプロンプトパラメータのトレーニングではなく)を予測する。
0.72
Prompt tuning and other PEFT methods have also been explored outside of the context of language models (e g vision [22, 69] and vision-and-language models [26]).
Recent work has analyzed training with discrete prompts, demonstrating a boost in performance with prompting when training on various numbers of examples [71], finding that models perform similarly when trained on good and bad prompts [11], and exploring which prompts work well for few-shot and full-shot setting [72].
There have also been efforts to develop methods that find performant discrete prompts [73, 74] and training prompts using methods similar to prompt tuning [75].
There has also been a great deal of work on improving ICL.
ICLの改善にも多くの取り組みがあります。
0.61
Chen et al [5], Min et al [6] use ICL for meta-learning to perform few-shot learning on new tasks.
chen et al [5], min et al [6]はメタラーニングにiclを使用し、新しいタスクでわずかな学習を行う。
0.71
Lampinen et al [7] show ICL can improve when explanations are provided and [8] use ICL with text retrieved from the web for open-domain question-answering.
lampinen氏ら[7]は、説明を提供したときにiclが改善できることを示し、 [8] オープンドメインの質問応答のためにwebから検索されたテキストでiclを使用する。 訳抜け防止モード: Lampinen et al [ 7 ] の説明により ICL は改善できることを示す そして [8 ] は Web から取得したテキストで ICL を使用します。
0.83
Meanwhile, Min et al [9] analyze how ICL works and show that ICL can still perform well when incorrect labels are provided for the in-context examples.
一方、min et al [9] は icl の動作を分析し、in-context の例に誤ったラベルが提供されている場合、 icl がうまく機能することを示す。
0.60
With the advent of large language models with billions of parameters, there has been a great deal of recent interest in PEFT methods.
In concurrent work, Mahabadi et al [76] compare PEFT to the use of discrete prompts (e g PET [70]) during few-shot fine-tuning and find that PEFT compares favorably.
同時作業では、Mahabadi et al [76] は、PEFT を数発の微調整中に離散的なプロンプト (e g PET [70]) の使用と比較し、PEFT が好意的に比較されることを見出した。 訳抜け防止モード: 同時作業では,Mahabadi et al [76 ] がPEFT と離散的なプロンプト (eg PET [70 ] ) を数発のショットファイン-チューニングで比較した。 PEFTは好意的に比較できる。
0.74
Also concurrently, Moosavi et al [77] propose a framework for introducing adapters whose architecture and design vary from task to task and demonstrate improved results in few-shot settings.
Gu et al [18] and Vu et al [19] both explored how pre-training prompt tuning parameters can improve when limited labeled data is available.
Gu et al [18] と Vu et al [19] はどちらも、ラベル付きデータが制限された場合に、事前トレーニングのプロンプトチューニングパラメータが改善する方法について検討した。 訳抜け防止モード: gu et al [ 18 ] と vu et al [ 19 ] が pre - トレーニングプロンプトチューニングパラメータは、制限されたラベル付きデータがある場合に改善される。
0.72
6 Conclusion We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot in-context learning at a lower computational cost.
T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices.
When applying T-Few as-is (with no task-specific hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human performance for the first time and outperformed prior submissions by a large margin.
Through detailed characterization of computational costs, we found that T-Few uses over 1,000× fewer FLOPs during inference than few-shot in-context learning with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU.
Acknowledgments and Disclosure of Funding We thank Brian Lester and Noah Constant for helpful discussion on debugging prompt tuning and Rabeeh Karimi Mahabadi for help with Compacter and Intrinsic SAID.
We also thank Stella Biderman and the Google TPU Research Cloud who provided valuable computational resources to support this work.
また、Stella Biderman氏とGoogle TPU Research Cloudにも感謝しています。
0.44
This work was supported by NSF-AI Engage Institute DRL-2112635.
この研究はNSF-AI Engage Institute DRL-2112635によって支援された。
0.54
References [1] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al Multitask prompted training enables zero-shot task generalization.
参照 [1] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al Multitask による訓練によりゼロショットタスクの一般化が可能になった。
0.91
arXiv preprint arXiv:2110.08207, 2021.
arXiv preprint arXiv:2110.08207, 2021
0.40
[2] Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, et al RAFT: A real-world few-shot text classification benchmark.
[2] Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, et al RAFT: a real-world few-shot text classification benchmark。 訳抜け防止モード: 2 ] ニール・アレックス エリ・リフランド ルイス・ツンスタール abhishek thakur, pegah maham, c jess riedel, emmie hine, carolyn ashurst, paul sedille, alexis carlier, et al raft : a real-world few - shot text classification benchmark (英語)
0.65
arXiv preprint arXiv:2109.14076, 2021.
arXiv preprint arXiv:2109.14076, 2021
0.40
[3] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners.
言語モデルは教師なしマルチタスク学習者である。
0.60
OpenAI blog, 2019.
OpenAIブログ、2019年。
0.83
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al Language models are few-shot learners.
4] tom b. brown, benjamin mann, nick ryder, melanie subbiah, jared kaplan, prafulla dhariwal, arvind neelakantan, pranav shyam, girish sastry, amanda askell, そしてal言語モデルは、わずかなショット学習者です。
0.69
arXiv preprint arXiv:2005.14165, 2020.
arxiv プレプリント arxiv:2005.14165, 2020
0.44
[5] Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He.
[6] Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi.
6]Sewon Min, Mike Lewis, Luke Zettlemoyer, Hannaneh Hajishirzi。
0.34
Metaicl: Learning to
Metaicl: 学ぶこと
0.77
learn in context. arXiv preprint arXiv:2110.15943, 2021.
文脈で学ぶ。 arXiv preprint arXiv:2110.15943, 2021
0.49
11
11
0.43
英語(論文から抽出)
日本語訳
スコア
[7] Andrew Kyle Lampinen, Ishita Dasgupta, Stephanie C. Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill.
7]Andrew Kyle Lampinen, Ishita Dasgupta, Stephanie C. Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, Felix Hill。 訳抜け防止モード: [7 ]Andrew Kyle Lampinen, Ishita Dasgupta, Stephanie C. Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland ジェーン・X・ワンとフェリックス・ヒル。
0.87
Can language models learn from explanations in context?
言語モデルは文脈の説明から学ぶことができるか?
0.68
ArXiv, abs/2204.02329, 2022.
ArXiv, abs/2204.02329, 2022。
0.35
[8] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev.
Internetaugmented language models through few-shot prompting for open-domain question answering.
オープンドメインの質問応答のための短時間のプロンプトによるインターネット型言語モデル。
0.52
arXiv preprint arXiv:2203.05115, 2022.
arXiv preprint arXiv:2203.05115, 2022
0.40
[9] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer.
9]Sewon Min,Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer。 訳抜け防止モード: [9]スウォンミン、新西リュー、アリ・ホルツマン Mikel Artetxe、Mike Lewis、Hannaneh Hajishirzi、Luke Zettlemoyer。
0.63
Rethinking the role of demonstrations: What makes in-context learning work?
デモの役割を再考する: インコンテキスト学習が機能する理由
0.67
arXiv preprint arXiv:2202.12837, 2022.
arXiv preprint arXiv:2202.12837, 2022
0.40
[10] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al Benchmarking generalization via in-context instructions on 1,600+ language tasks.
10] yizhong wang, swaroop mishra, pegah alipoormolabashi, yeganeh kordi, amirreza mirzaei, anjana arunkumar, arjun ashok, arut selvan dhanasekaran, atharva naik, david stap, et al 言語タスクのコンテキスト内命令による一般化ベンチマーク。
0.65
arXiv preprint arXiv:2204.07705, 2022.
arXiv preprint arXiv:2204.07705, 2022
0.40
[11] Albert Webson and Ellie Pavlick.
11]アルバート・ウェブソンと エリー・パヴリック
0.62
Do prompt-based models really understand the meaning of
プロンプトベースのモデルは本当に意味を理解するか
0.64
their prompts? arXiv preprint arXiv:2109.01247, 2021.
彼らのプロンプト? arXiv preprint arXiv:2109.01247, 2021
0.52
[12] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh.
12]Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh。
0.35
Calibrate before use: Improving few-shot performance of language models.
使用前に校正する: 言語モデルの数少ないパフォーマンスを改善する。
0.70
arXiv preprint arXiv:2102.09690, 2021.
arXiv preprint arXiv:2102.09690, 2021
0.40
[13] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen.
LoRA: Low-rank adaptation of large language models.
LoRA: 大きな言語モデルの低ランク適応。
0.82
ArXiv, abs/2106.09685, 2021.
ArXiv, abs/2106.09685, 2021。
0.35
[14] Brian Lester, Rami Al-Rfou, and Noah Constant.
ブライアン・レスター、ラミ・アル=ルフー、ノア・コンスタン。
0.36
The power of scale for parameter-efficient
パラメータ効率のためのスケールのパワー
0.82
prompt tuning.
迅速なチューニング。
0.52
arXiv preprint arXiv:2104.08691, 2021.
arXiv preprint arXiv:2104.08691, 2021
0.40
[15] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
[15]Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu。 訳抜け防止モード: [15 ]Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou ピーター・J・リウ(Peter J. Liu)。
0.86
Exploring the limits of transfer learning with a unified text-to-text transformer.
統一テキスト-テキストトランスフォーマによるトランスファー学習の限界の検討
0.82
ArXiv, abs/1910.10683, 2020.
axiv、abs/1910.10683、2020年。
0.50
[16] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin Raffel.
16]Derek Tam,Rakesh R Menon,Mohit Bansal,Shashank Srivastava,Colin Raffel。
0.32
Improving and simplifying pattern exploiting training.
改良 パターンを単純化する訓練です
0.71
arXiv preprint arXiv:2103.11955, 2021.
arXiv preprint arXiv:2103.1 1955, 2021
0.36
[17] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston.
Sean Welleck氏、Ilia Kulikov氏、Stephen Roller氏、Emily Dinan氏、Yunghyun Cho氏、Jason Weston氏。
0.72
Neural text generation with unlikelihood training.
異種訓練によるニューラルテキスト生成
0.60
arXiv preprint arXiv:1908.04319, 2019.
arxiv プレプリント arxiv: 1908.04319, 2019
0.42
[18] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang.
[29] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig.
[29]Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig。
0.40
Towards a unified view of parameter-efficient transfer learning.
パラメータ効率変換学習の統一的な視点に向けて
0.66
arXiv preprint arXiv:2110.04366, 2021.
arXiv preprint arXiv:2110.04366, 2021
0.40
[30] Ethan Perez, Douwe Kiela, and Kyunghyun Cho.
[30]Ethan Perez、Douwe Kiela、Kunghyun Cho。
0.63
True few-shot learning with language models.
言語モデルによる真に数発の学習。
0.63
arXiv preprint arXiv:2105.11447, 2021.
arXiv preprint arXiv:2105.11447, 2021
0.40
[31] Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow.
Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, Ian Goodfellow.
0.31
Realistic evaluation of deep semi-supervised learning algorithms.
深層半教師付き学習アルゴリズムの現実的評価
0.77
Advances in Neural Information Processing Systems, 2018.
ニューラル情報処理システム(2018年)。
0.57
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
[32]Ashish Vaswani氏、Noam Shazeer氏、Niki Parmar氏、Jakob Uszkoreit氏、Llion Jones氏、Aidan N. Gomez氏、Sukasz Kaiser氏、Illia Polosukhin氏。
0.67
Attention is all you need.
注意はあなたが必要とするすべてです。
0.63
Advances in Neural Information Processing Systems, 2017.
ニューラル情報処理システム(2017年)の進歩
0.72
[33] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
[33]ヤコブ・デヴリン、ミン・ウェイ・チャン、ケントン・リー、クリスティーナ・トータノワ。
0.39
BERT: Pre-training of deep bidirectional transformers for language understanding.
BERT: 言語理解のための双方向トランスフォーマーの事前トレーニング。
0.76
arXiv preprint arXiv:1810.04805, 2018.
arXiv preprint arXiv:1810.04805, 2018
0.39
[34] Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al PromptSource: An integrated development environment and repository for natural language prompts.
[34] Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al PromptSource: 自然言語プロンプトのための統合開発環境とリポジトリ。
0.89
arXiv preprint arXiv:2202.01279, 2022.
arXiv preprint arXiv:2202.01279, 2022
0.40
[35] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al Transformers: Stateof-the-art natural language processing.
Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
第13回知識表現と推論の原則に関する国際会議(2012年)
0.82
13
13
0.85
英語(論文から抽出)
日本語訳
スコア
[43] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi.
[43]坂口敬介、ロン・ル・ブラス、チャンドラ・バガヴァトゥラ、エジン・チョイ
0.48
Winogrande: An adversarial winograd schema challenge at scale.
Winogrande: スケールの逆のWinogradスキーマの問題です。
0.77
In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
AAAI Conference on Artificial Intelligence, 2020に参加して
0.72
[44] Mohammad Taher Pilehvar and Jose Camacho-Collados.
44] Mohammad Taher Pilehvar氏とJosé Camacho-Collados氏。
0.42
WiC: the word-in-context dataset for evaluating context-sensitive meaning representations.
WiC: コンテキスト依存の意味表現を評価するためのワード・イン・コンテキストデータセット。
0.52
arXiv preprint arXiv:1808.09121, 2018.
arXiv preprint arXiv:1808.09121, 2018
0.39
[45] Xiang Lisa Li and Percy Liang.
[45]Xiang Lisa LiとPercy Liang。
0.37
Prefix-Tuning: Optimizing continuous prompts for generation.
プレフィックスチューニング: 生成のための継続的プロンプトの最適化。
0.57
arXiv preprint arXiv:2101.00190, 2021.
arXiv preprint arXiv:2101.00190, 2021
0.40
[46] Stefan Van Der Walt, S. Chris Colbert, and Gael Varoquaux.
[46] ステファン・ファン・デル・ウォルト、 クリス・コルバート、 ガエル・ヴァロクー
0.54
The numpy array: a structure for
numpy配列: 構造体
0.57
efficient numerical computation. Computing in science & engineering, 13(2), 2011.
効率的な数値計算 科学と工学の計算, 13(2), 2011
0.69
[47] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg.
Elad Ben Zaken氏、Shauli Ravfogel氏、Yoav Goldberg氏。
0.29
BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.
bitfit: トランスフォーマーベースのマスク言語モデルのパラメータ効率の良い微調整。
0.59
arXiv preprint arXiv:2106.10199, 2021.
arXiv preprint arXiv:2106.10199, 2021
0.40
[48] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V. Le.
[48]Jason Wei、Maarten Bosma、Vincent Y. Zhao、Kelvin Guu、Adams Wei Yu、Brian Lester、Nan Du、Andrew M Dai、Quoc V. Le。 訳抜け防止モード: [48 ]Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du アンドリュー・M・ダイとクオック・V・ル。
0.83
Finetuned language models are zero-shot learners.
微調整言語モデルはゼロショット学習者です。
0.57
arXiv preprint arXiv:2109.01652, 2021.
arXiv preprint arXiv:2109.01652, 2021
0.40
[49] Noam Shazeer and Mitchell Stern.
49]ノーム・シェーザーとミッチェル・スターン。
0.62
Adafactor: Adaptive learning rates with sublinear memory
Adafactor: サブリニアメモリを用いた適応学習率
0.76
cost. In International Conference on Machine Learning.
コスト。 機械学習に関する国際会議に参加。
0.74
PMLR, 2018.
2018年、PMLR。
0.68
[50] Timo Schick and Hinrich Schütze.
ティモ・シック(Timo Schick)とヒンリヒ・シュッツェ(Hinrich Schütze)。
0.56
True few-shot learning with prompts–a real-world perspective.
プロンプトによる真の少数ショット学習 – 現実世界の視点。
0.53
arXiv preprint arXiv:2111.13440, 2021.
arXiv preprint arXiv:2111.13440, 2021
0.40
[51] Moshe Wasserblat.
[51]moshe wasserblat.
0.37
Sentence transformer fine-tuning (SetFit): Outperforming GPT-3 on few-
文変圧器微調整(SetFit):少数のGPT-3の性能-
0.62
shot text-classification while being 1600 times smaller, 2021.
1600倍小さい2021年のテキスト分類。
0.55
[52] Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay.
[58] Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al Prompt waywardness: The curious case of discretized interpretation of continuous prompts.
58]Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al Prompt waywardness: 連続的プロンプトの離散解釈の興味深いケース。 訳抜け防止モード: 58]daniel khashabi, shane lyu, sewon min, リアンヒューイ・キン、カイル・リチャードソン、サマー・シン、ショーン・ウェレック hannaneh hajishirzi, tushar khot, ashish sabharwal, et al prompt waywardness: the curious case of discretized interpretation of continuous prompts。
0.66
arXiv preprint arXiv:2112.08348, 2021.
arXiv preprint arXiv:2112.08348, 2021
0.40
[59] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister.
[59]Zifeng Wang、Zizhao Zhang、Chen-Yu Lee、Han Zhang、Ruoxi Sun、Xiaoqi Ren、Guolong Su、Vincent Perot、Jennifer Dy、Tomas Pfister。 訳抜け防止モード: [59 ]Zifeng Wang,Zizhao Zhang,Chen - Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot ジェニファー・ダイとトマ・フィスター。
0.75
Learning to prompt for continual learning.
継続的学習を促すための学習。
0.79
arXiv preprint arXiv:2112.08654, 2021.
arXiv preprint arXiv:2112.08654, 2021
0.40
[60] Zonghan Yang and Yang Liu.
60] 張陽と李陽。
0.24
On robust prefix-tuning for text classification.
テキスト分類のためのロバストプレフィックスチューニングについて
0.52
arXiv preprint
arXiv プレプリント
0.83
arXiv:2203.10378, 2022.
arxiv:2203.10378、2022。
0.37
14
14
0.42
英語(論文から抽出)
日本語訳
スコア
[61] Yuting Yang, Pei Huang, Juan Cao, Jintao Li, Yun Lin, Jin Song Dong, Feifei Ma, and Jian Zhang.
[61]Yuting Yang, Pei Huang, Juan Cao, Jintao Li, Yun Lin, Jin Song Dong, Feifei Ma, Jian Zhang 訳抜け防止モード: [61]ユティン・ヤン、ペイ・フン、フアン・カオ ジンタオ・リー、ユン・リン、ジン・ソン・ドン、フェイファイ・マ そしてjian zhang。
0.63
A prompting-based approach for adversarial example generation and robustness enhancement.
逆例生成とロバスト性向上のためのプロンプトベースアプローチ
0.69
arXiv preprint arXiv:2203.10714, 2022.
arXiv preprint arXiv:2203.10714, 2022
0.40
[62] Xiaochen Liu, Yu Bai, Jiawei Li, Yinan Hu, and Yang Gao.
【62]青王チェン・リウ、ユウ・バイ、ジャワイ・リー、イナン・フ、ヤン・ガオ。
0.47
PSP: Pre-trained soft prompts for
PSP:事前訓練されたソフトプロンプト
0.72
few-shot abstractive summarization.
few‐shot abstractive summarization
0.44
arXiv preprint arXiv:2204.04413, 2022.
arXiv preprint arXiv:2204.04413, 2022
0.40
[63] Xavier Garcia and Orhan Firat.
ザビエル・ガルシアとオルハン・フィラト。
0.36
Using natural language prompts for machine translation.
機械翻訳に自然言語プロンプトを使用する。
0.79
arXiv preprint arXiv:2202.11822, 2022.
arXiv プレプリントarxiv:2202.11822、2022。
0.42
[64] Hunter Lang, Monica Agrawal, Yoon Kim, and David Sontag.
64]ハンター・ラング、モニカ・アクロアル、ユン・キム、デイヴィッド・ソンタグ
0.56
Co-training improves prompt-
コートレーニングはプロンプトを改善する
0.47
based learning for large language models.
大規模な言語モデルに基づく学習です
0.82
arXiv preprint arXiv:2202.00828, 2022.
arXiv preprint arXiv:2202.00828, 2022
0.40
[65] Boshi Wang, Xiang Deng, and Huan Sun.
[65]ボシ・ワン、Xiang Deng、Huan Sun。
0.30
Shepherd pre-trained language models to develop a
シェパード事前学習型言語モデルの開発
0.76
train of thought: An iterative prompting approach.
train of thought: 反復的なプロンプトアプローチ。
0.77
arXiv preprint arXiv:2203.08383, 2022.
arXiv preprint arXiv:2203.08383, 2022
0.40
[66] Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang, Zhilin Yang, and Jie Tang.
[66]玄宗、大陽、清陽、ホンシャ・ヤン、ジリン・ヤン、江唐
0.49
Controllable generation from pre-trained language models via inverse prompting.
逆プロンプトによる事前学習言語モデルからの制御可能生成
0.77
arXiv preprint arXiv:2103.10685, 2021.
arXiv preprint arXiv:2103.10685, 2021
0.40
[67] Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Zhiyuan Liu, Peng Li, Juanzi Li, Lei Hou, Maosong Sun, et al On transferability of prompt tuning for natural language understanding.
[67]yusheng su, xiaozhi wang, yujia qin, chi-min chan, yankai lin, zhiyuan liu, peng li, juanzi li, lei hou, maosong sun, et al 自然言語理解のための素早いチューニングの移行性について。
0.77
arXiv preprint arXiv:2111.06719, 2021.
arxiv プレプリント arxiv:2111.06719, 2021。
0.41
[68] Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, et al HyperPrompt: Prompt-based task-conditioning of transformers.
68]Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, et al HyperPrompt: Promptベースのトランスフォーマーのタスクコンディショニング。
A Compute Description All T0-3B models were trained on 48GB A6000s.
A Compute Description 全てのT0-3Bモデルは48GB A6000で訓練された。
0.62
Training T0-3B with different PEFT methods took about an hour to train, except for Intrinsic SAID and FishMask which each took about two hours to train.
FT 75.85.4 + UL 77.61.4 + LN 75.84.3 + UL + LN 79.83.6
FT 75.85.4 + UL 77.61.4 + LN 75.84.3 + UL + LN 79.83.6
0.27
82.15.4 89.31.8 89.37.1 87.55.4
82.15.4 89.31.8 89.37.1 87.55.4
0.14
47.81.5 47.91.9 48.20.6 46.62.5
47.81.5 47.91.9 48.20.6 46.62.5
0.14
40.60.8 40.91.9 40.90.9 41.30.9
40.60.8 40.91.9 40.90.9 41.30.9
0.14
37.81.8 38.85.0 38.31.6 40.25.3
37.81.8 38.85.0 38.31.6 40.25.3
0.14
Table 3: Per-dataset results for comparing the effect of including the additional loss terms introduced in section 3.2.
表3: 第3条2に導入された損失項の追加効果を比較するためのデータセット毎の結果。 訳抜け防止モード: 表3 : Per - データセットの結果 第3条2項の損失項の追加を含む効果の比較
0.82
D Full PEFT Results We compare against the following PEFT methods, using a linear decay with warmup scheduler with a warm-up ratio of 0.06 and the Adafactor optimizer [49].
D Full PEFT Results We compared the following PEFT methods, using a linear decay with warmup scheduler with a warm-up ratio of 0.06 and the Adafactor optimizationr [49]。 訳抜け防止モード: D Full PEFT Results We compare the following PEFT method。 温度上昇比0.06の温度上昇スケジューラで線形減衰を使用する そして Adafactor オプティマイザ [ 49 ] も。
0.73
Table 4 shows the full per-dataset results of all PEFT methods we considered.
表4は、考慮したすべてのPEFTメソッドのデータセット毎の完全な結果を示しています。
0.54
Full Model Fine-tuning We train for 300 steps with a learning rate of 3e−4.
Then, these parameters are trained for 1500 steps with a learning rate of 3e−4.
次に、これらのパラメータを3e−4の学習率で1500ステップに訓練する。
0.71
Intrinsic SAID [27] We train for 3K steps with a learning rate of 3e−2 LoRA [13] We use a rank of 4 with initialization scale of 0.01 and update all the attention and
Since Banking 77 has 77 classes which causes memory issues for unlikelihood training, we turn off unlikelihood training for Banking 77.
バンク77には77のクラスがあるので、バンク77とは違って記憶上の問題が生じます。
0.59
We also feed in all the labels as part of the input string for Banking 77 since there were some labels never seen during training and clean the labels by replacing "." with ",".