Fugu-MT 論文翻訳(概要): Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

論文の概要: Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

arxiv url: http://arxiv.org/abs/2510.04476v1
Date: Mon, 06 Oct 2025 04:24:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.676907
Title: Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
Title（参考訳）: 圧縮型畳み込み注意:圧縮型潜在空間における効率的な留意
Authors: Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge,
Abstract要約: MHA(Multi-headed Attention's)の二次計算と線形に成長するKV-cacheは、長いコンテキストトランスフォーマーの訓練と提供に費用がかかる。本稿では,CCA(Compressed Convolutional Attention)を提案する。クエリ,キー,値をダウンプロジェクションし,共有潜在空間内でのアテンション操作全体を実行する新しいアテンション手法である。実験の結果、CCGQAはGQA(Grouped Query Attention)とMLA(Multi-Latent Attention)の両方を高密度モデルとMoEモデルで同等のKV-cache圧縮で一貫して上回っていることがわかった。
参考スコア（独自算出の注目度）: 12.98205656003145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.
Abstract（参考訳）: MHA(Multi-headed Attention's)の二次計算と線形に成長するKV-cacheは、長いコンテキストトランスフォーマーの訓練と提供に費用がかかる。 Grouped Query Attention (GQA)やMulti-Latent Attention (MLA)といった以前の作業はキャッシュを縮小し、デコードを高速化するが、プリフィルとトレーニングの速度を決定する計算は、ほとんど変わらない。本稿では,CCA(Compressed Convolutional Attention)を提案する。クエリ,キー,値をダウンプロジェクションし,共有潜在空間内でのアテンション操作全体を実行する新しいアテンション手法である。この単純な設計は、パラメータ、KV-cache、FLOPを所望の圧縮係数で同時に削減する。 CCAはヘッドシェアリングと直交するので、この2つを組み合わせてCompressed Convolutional Grouped Query Attention (CCGQA) を形成します。実験の結果、CCGQAは密度モデルとMoEモデルにおいて、GQAとMLAの両方で同等なKV-cache圧縮で一貫して優れていた。さらに、CCGQAは、GQAとMLAの半分のKVキャッシュを持つMoEモデルにおいて、標準のMHAと比較して8倍のKVキャッシュ圧縮を実現し、他のすべての注意方法よりも優れていることを示す。 CCAとCCGQAはまた、FLOPの注意コストを劇的に削減し、既存の方法よりもトレーニングとプリフィルが大幅に速くなりました。 H100 GPUでは、融合CCA/CCGQAカーネルは、MHAと比較して16kのシーケンス長で、プリフィルのレイテンシを約1.7倍削減し、約1.3倍高速化する。

論文の概要: Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

関連論文リスト