Fugu-MT 論文翻訳(概要): D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations

論文の概要: D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations

arxiv url: http://arxiv.org/abs/2510.13147v1
Date: Wed, 15 Oct 2025 04:56:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.49938
Title: D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations
Title（参考訳）: D-com: アクティベーションの低ランク分解を可能にする反復処理の高速化
Authors: Faraz Tahmasebi, Michael Pelluer, Hyoukjun Kwon,
Abstract要約: 本稿では, 入力分解アルゴリズムとハードウェアサポートの適切な選択により, 入力分解が著しく有用であることを示す。我々は、進行分解アルゴリズム、Lanczosアルゴリズムを採用し、分解アルゴリズムの共加速器アーキテクチャを設計する。当社のアクセラレータであるD-comは、モデル品質の劣化を小さくするコストで、A100 GPUと比較して、エンドツーエンドのレイテンシを22%改善します。
参考スコア（独自算出の注目度）: 2.4698886064068555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank decomposition have been explored. Previous model decomposition works have focused on weight decomposition to avoid costly runtime decomposition, whose latency often significantly exceeds the benefits from decomposition (e.g., 38% more end-to-end latency when running Llama2-7b on A100 with 4K sequence length with activation decomposition compared to no decomposition). In this work, we debunk such observations and report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support. We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm. To address the memory- boundness of the decomposition operation, we introduce a novel compute replication methodology that moves the op- eration toward compute-bound region, which enables 6.2x speedup in our evaluation. We also develop an output shape- preserving computation scheme that eliminates decomposi- tion costs in consecutive layers. To compensate model quality loss from compression, we introduce a multi-track decom- position approach that separately handles outlier channels for high accuracy and low perplexity with minimal compu- tational costs. Combined together, our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation (e.g., 3% on AI2 Reasoning Challenge task).
Abstract（参考訳）: 大規模言語モデルの計算とメモリコストは、過去10年間で増加し続け、1Tパラメータのスケールに到達した。大規模モデルによる課題に対処するため,低ランク分解などのモデル圧縮手法が検討されている。例えば、A100上でLlama2-7bを4Kのシーケンス長で実行した場合のレイテンシは、分解を伴わずにアクティベートして4Kのシーケンス長を持つ場合のレイテンシが38%向上する)。本研究では,そのような観測をデバンクし,入力分解が分解アルゴリズムとハードウェアサポートを適切に選択することで有益であることを示す。我々は、進行分解アルゴリズム、Lanczosアルゴリズムを採用し、分解アルゴリズムの共加速器アーキテクチャを設計する。分解操作のメモリバウンダリ性に対処するために,オペエレーションを計算バウンダリ領域へ移動させる新しい計算レプリケーション手法を導入し,評価の6.2倍の高速化を実現した。また, 連続層における分解・分解コストを低減できる出力形状保存型計算手法を開発した。圧縮によるモデル品質の損失を補うため,我々は,コンプレーションコストを最小限に抑えながら,高精度かつ低コンプレクティビティで外部チャネルを別々に扱うマルチトラック・デコム・ポジション・アプローチを導入する。私たちのアクセラレータであるD-comは、小さなモデル品質の劣化(AI2 Reasoning Challengeタスクの3%など)を犠牲にして、A100 GPUと比較して、エンドツーエンドのレイテンシを22%改善します。

論文の概要: D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations

関連論文リスト