Fugu-MT 論文翻訳(概要): FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

論文の概要: FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

arxiv url: http://arxiv.org/abs/2603.14591v1
Date: Sun, 15 Mar 2026 20:26:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.905239
Title: FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
Title（参考訳）: FlashHead: 言語モデル推論における分類ヘッドの効率的なドロップイン置換
Authors: Wilhelm Tranheden, Shahnawaz Ahmed, Devdatt Dubhashi, Jonna Matthiesen, Hannes von Essen,
Abstract要約: 我々はFlashHeadを紹介します。これは、トレーニング不要でハードウェアフレンドリーな、高密度な分類ヘッドの代替品です。 FlashHeadは、情報検索の原則に基づいて、出力ヘッドでの計算を検索問題として再定義する。我々は、FlashHeadがモデルレベルの推論スピードアップを textbf1.75x まで提供し、元のヘッドと比較して出力精度を維持することを示した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads, enabling thousands of clusters to be scored in parallel, (3) a novel inference-time sampling mechanism that extends retrieval beyond top tokens, enabling probabilistic sampling across the full vocabulary, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to \textbf{1.75x} which maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.
Abstract（参考訳）: 言語モデルは、消費者デバイス向けに最適化されたより小さなアーキテクチャを採用する傾向にある。この設定では、推論効率が第一の制約である。一方、語彙のサイズは急速に増加し続けており、分類ヘッドは最大60 %のモデルパラメータと50 %の推論計算のボトルネックとなっている。私たちはFlashHeadを紹介します。これは、トレーニング不要でハードウェアフレンドリーな、高密度な分類ヘッドの最初の効率的なドロップイン代替です。 FlashHeadは、情報検索の原理に基づいており、その出力ヘッドでの計算を、完全な語彙に対する厳密な分類というよりは、検索問題として捉えている。 FlashHeadは、(1)語彙分割をコンパクトなハードウェア効率のテンソルに構造化するバランスの取れたクラスタリングスキーム、(2)言語モデルヘッドにマルチプローブ検索を拡張し、数千のクラスタを並列に取得可能にすること、(3)トップトークンを越えて検索を拡張し、全語彙にわたって確率的サンプリングを可能にする新しい推論時サンプリング機構、(4)ヘッド内で有効な低ビット計算を可能にする選択的量子化、の4つの主要なイノベーションを紹介している。 Llama-3.2、Gemma-3、Qwen-3の実験では、FlashHeadはモデルレベルの推論速度を、元のヘッドと比較して出力の精度を維持するtextbf{1.75x}まで提供することを示した。分類ヘッドボトルネックを克服することで、FlashHeadは効率的な推論のための新しいベンチマークを確立し、コンシューマハードウェア用の小型で有能なモデルを開発する上で重要な障壁を取り除く。

論文の概要: FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

関連論文リスト