Fugu-MT 論文翻訳(概要): Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

論文の概要: Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

arxiv url: http://arxiv.org/abs/2601.14277v1
Date: Sun, 11 Jan 2026 18:52:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-25 16:54:51.826745
Title: Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct
Title（参考訳）: どの量子化を使うべきか? Llama-3.1-8B-インストラクトにおけるllama.cpp量子化の統一評価
Authors: Uygar Kurt,
Abstract要約: 量子化(quantization)は、モデル重みの保存と操作に使用する精度を低減し、大規模言語モデルをデプロイしやすくする技術である。単一近代モデル Llama-3.1-8B-Instruct (KFP16, GGUF) におけるラマ量子化の統一的な実証的研究について述べる。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware, which is especially relevant for users running models locally. Quantization in llama.cpp enables large language models to run on commodity hardware, but available formats are often evaluated inconsistently, making it hard to choose among schemes. We present a unified empirical study of the llama.cpp quantization on a single modern model, Llama-3.1-8B-Instruct (FP16, GGUF), covering 3-8 bit K-quant and legacy formats. We evaluate downstream task performance across standard reasoning, knowledge, instruction-following, and truthfulness benchmarks, and also measure perplexity and CPU throughput (prefill/decoding) alongside model size, compression, and quantization time. Ultimately, this work is a practical guide for choosing a llama.cpp quantization scheme, helping readers make informed, context-aware decisions for their intended use and resource budget.
Abstract（参考訳）: 量子化(quantization)は、モデル重みの保存と操作に使用する精度を低減し、大規模言語モデルをデプロイしやすくする実践的な手法である。これによりメモリ使用量が少なくなり、制約のあるハードウェア上でのランタイムの実現性も向上する。 llama.cppの量子化は、大きな言語モデルをコモディティなハードウェア上で実行可能にするが、利用可能なフォーマットはしばしば一貫性が無く評価され、スキーマの選択が困難になる。本稿では, 1 つの近代モデル Llama-3.1-8B-Instruct (FP16, GGUF) における llama.cpp 量子化の統一的な実証的研究を行い, 3-8 ビット K-quant およびレガシフォーマットについて述べる。我々は、標準的な推論、知識、命令追従、真理性ベンチマークにまたがるダウンストリームタスクのパフォーマンスを評価し、モデルサイズ、圧縮、量子化時間とともに、パープレキシティとCPUスループット(プリフィル/デコーディング)を測定した。最終的に、この研究はllama.cpp量子化スキームを選択するための実践的なガイドであり、読者が意図した用途とリソース予算について、理解され、文脈に合った決定を下すのに役立つ。

論文の概要: Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

関連論文リスト