Fugu-MT 論文翻訳(概要): LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

論文の概要: LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

arxiv url: http://arxiv.org/abs/2604.22050v1
Date: Thu, 23 Apr 2026 20:12:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.259723
Title: LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Title（参考訳）: LayerBoost: 効率的なLCMのためのレイヤアウェアアテンション低減
Authors: Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric,
Abstract要約: LayerBoostは推論遅延を低減し、スループットを最大68%向上する。いくつかのベンチマークでベースモデルのパフォーマンスと一致し、他のベンチマークでは小さな劣化しか示さず、最先端の注目線形化手法よりも大幅に優れています。
参考スコア（独自算出の注目度）: 3.80555579179805
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.
Abstract（参考訳）: トランスフォーマーは主にソフトマックスの注意を頼りにしており、シーケンス長に関して2次複雑さを導入し、効率的な推論のボトルネックとなっている。線形またはハイブリッドの注意に関する以前の作業は、通常、すべての層でソフトマックスの注意を均一に置き換える。本研究では,個々のトランス層の感度に基づいてアテンションメカニズムを選択的に修飾するレイヤ対応アテンション低減手法であるLayerBoostを提案する。まず、事前訓練されたモデルでシステマティックな感度分析を行い、パフォーマンスを維持するために重要なレイヤを特定する。この分析で導かれた3つの戦略は、高感度層における標準ソフトマックスの注意を保ち、適度に感度の低い層において線形なスライディングウインドウの注意に置き換え、低感度層における注意を完全に取り除くことである。これらのアーキテクチャ変更後の性能回復のために,1000万個の追加のトレーニングトークンを必要とせず,軽度蒸留法を併用したヒーリングフェーズを導入する。 LayerBoostは、競合するモデル品質を維持しながら、推論レイテンシを低減し、高い並行性でスループットを最大68%改善する。いくつかのベンチマークでベースモデルのパフォーマンスと一致し、他のベンチマークでは小さな劣化しか示さず、最先端の注目線形化手法よりも大幅に優れています。これらの効率向上により,提案手法は特に,推論コストとメモリフットプリントが重要なボトルネックとなる,高コンカレンシーサービスやハードウェア制約のデプロイメントシナリオに適している。

論文の概要: LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

関連論文リスト