Fugu-MT 論文翻訳(概要): EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

論文の概要: EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

arxiv url: http://arxiv.org/abs/2509.16686v1
Date: Sat, 20 Sep 2025 13:27:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:15.927368
Title: EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
Title（参考訳）: EG-MLA:スケーラブルかつ効率的なLLMのための埋め込み型マルチヘッド遅延注意
Authors: Zhengge Cai, Haowen Hou,
Abstract要約: キー値(KV)キャッシュサイズは、大規模言語モデル(LLM)における効率的な推論を実現するための重要なステップである。最近のMLA(Multi-head Latent Attention)の研究は、KV表現を共有潜在空間に圧縮することでこれを緩和している。 MLAの新たな拡張である textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA) を提案する。
参考スコア（独自算出の注目度）: 8.093922145280326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6\% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9\% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
Abstract（参考訳）: キー値(KV)キャッシュサイズを減らすことは、特にレイテンシとメモリ制約の下で、大きな言語モデル(LLM)の効率的な推論を可能にするための重要なステップである。 MHA(Multi-Head Attention)は強力な表現力を提供するが、大きなメモリオーバーヘッドを引き起こす。最近のMLA(Multi-head Latent Attention)の研究は、KV表現を共有潜在空間に圧縮することで、パフォーマンスとキャッシュ効率のトレードオフを改善することでこれを緩和している。 MLAはKVキャッシュの大幅な削減をすでに達成しているが、さらなる圧縮のスコープは性能の低下なしに制限されている。本稿では,表現表現性を高めつつ,KVキャッシュサイズをさらに小さくするMLAの新たな拡張である,EG-MLA(textbf{Embedding-Gated Multi-head Latent Attention)を提案する。 EG-MLAは、潜在空間に適用されるトークン固有の埋め込みゲーティング機構を導入し、最小限の追加計算で圧縮されたKVベクトルのきめ細かい変調を可能にする。 MHAと比較して、EG-MLAは91.6\%以上のKVキャッシュサイズ削減を実現し、性能劣化は無視できる。 MLAとは対照的に、EG-MLAは様々な推論ベンチマークのタスク精度を継続的に改善し、最大59.9%のメモリ節約を実現している。我々の理論的分析は、埋め込みゲーティングが高次相互作用を暗黙的に引き起こし、経験的評価によってモデルスケールと圧縮レジームをまたいだ堅牢な一般化が示されることを示している。特に,EG-MLAを10億以上のパラメータに拡張することに成功し,大規模LLMデプロイメントの実現可能性を示した。これらの結果は,現代のLLMにおいて,スケーラブルで高性能な推論を可能にするメモリおよび計算効率の高いアテンション機構としてEG-MLAを確立する。

論文の概要: EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

関連論文リスト