Fugu-MT 論文翻訳(概要): QLoRA: Efficient Finetuning of Quantized LLMs

論文の概要: QLoRA: Efficient Finetuning of Quantized LLMs

arxiv url: http://arxiv.org/abs/2305.14314v1
Date: Tue, 23 May 2023 17:50:33 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-24 13:47:35.033217
Title: QLoRA: Efficient Finetuning of Quantized LLMs
Title（参考訳）: QLoRA:量子化LDMの効率的な微細加工
Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
Abstract要約: 我々は,48GBのGPU上で65Bパラメータモデルを微調整するのに十分なメモリ使用量を削減する,効率的な微調整手法QLoRAを提案する。 QLoRAは凍結した4ビット量子化事前学習言語モデルを通して低ランクアダプタ(LoRA)に逆伝搬する最高のモデルファミリであるGuanacoは、Vicunaベンチマークでリリースされたすべてのモデルより優れています。
参考スコア（独自算出の注目度）: 66.58009990713134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
Abstract（参考訳）: QLoRAは,1つの48GB GPU上で65Bパラメータモデルを微調整するのに十分なメモリ使用量を削減し,全16ビットの微調整タスク性能を保っている。 QLoRAは、凍結した4ビットの量子化事前訓練言語モデルを通して勾配をローランクアダプタ~(LoRA)にバックプロパゲートする。私たちがguanacoと名づけた最高のモデルファミリは、これまでのvicunaベンチマークでリリースされたすべてのモデルよりも優れており、単一のgpuで24時間のみ微調整しながら、chatgptのパフォーマンスレベルの99.3%に達しています。 QLoRAは、パフォーマンスを犠牲にすることなくメモリを節約するための多くのイノベーションを紹介している。 (a) 4-bit NormalFloat (NF4) 正規分布重みに対して理論的に最適な情報である新しいデータ型 b) 量子化定数を定量化することにより平均メモリフットプリントを削減するための二重量子化 (c) メモリスパイクを管理するオプティマイザ。我々はQLoRAを使用して1000以上のモデルを微調整し、8つの命令データセット、複数のモデルタイプ(LLaMA、T5)、および通常の微調整で実行できないモデルスケール(33B、65Bパラメータモデルなど)にわたる命令追従とチャットボットのパフォーマンスを詳細に分析する。以上の結果から,QLoRAファインタニングは,従来のSoTAよりも小さなモデルを用いた場合であっても,最先端のデータセットに導かれることがわかった。本稿では,人間とGPT-4の評価に基づくチャットボットの性能の詳細な分析を行い,GPT-4の評価が人間の評価に対する安価で合理的な代替手段であることを示す。さらに、現在のチャットボットベンチマークでは、チャットボットのパフォーマンスレベルを正確に評価することは信用できない。レモンピクチャード分析は、グアナコがChatGPTと比較してどこで失敗したかを示している。 4ビットトレーニング用のCUDAカーネルを含む、すべてのモデルとコードをリリースしています。

論文の概要: QLoRA: Efficient Finetuning of Quantized LLMs

関連論文リスト