Fugu-MT 論文翻訳(概要): CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models

論文の概要: CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models

arxiv url: http://arxiv.org/abs/2509.25996v1
Date: Tue, 30 Sep 2025 09:28:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.489396
Title: CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
Title（参考訳）: CAST: 大規模言語モデルのための連続的かつ微分可能な半構造化スパーシアウェアトレーニング
Authors: Weiyu Huang, Yuezhou Hu, Jun Zhu, Jianfei Chen,
Abstract要約: スパシティアウェアトレーニングは、大きな言語モデルをハードウェアフレンドリーなスパースパターンに変換するための効果的なアプローチである。スパースモデルのための連続的かつ微分可能なスパース対応トレーニングフレームワークであるContinuous Adaptive Sparse Trainer (CAST)を提案する。以上の結果から,従来の最先端手法に比べて,トレーニングリソースの最小化による難易度とゼロショット精度の両面で有意な改善が見られた。
参考スコア（独自算出の注目度）: 27.682531424487564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparsity-aware training is an effective approach for transforming large language models (LLMs) into hardware-friendly sparse patterns, thereby reducing latency and memory consumption during inference. In this paper, we propose Continuous Adaptive Sparse Trainer (CAST), a fully continuous and differentiable sparsity-aware training framework for semi-structured (or "N:M") sparse models. Unlike previous approaches that optimize sparsity patterns and weights separately, CAST enables seamless joint optimization during training, while progressively transforming the model into the desired sparsity format. Specifically, CAST introduces three key components: 1) AdamS, a sparsity-aware optimizer that leverages adaptive L1 decay to promote uniform sparsification across all parameters; 2) Weight Scaling, a module designed to mitigate the magnitude reduction caused by decay while preserving desired sparsity patterns; 3) Knowledge Distillation, which employs the dense model as a self-teacher to enhance training efficiency. We evaluate CAST under 2:4 sparsity patterns across multiple model families, ranging from 125M to 13B parameters. Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources. Notably, on LLaMA2-7B, our 2:4 sparse model achieves a negligible perplexity increase of 0.09 and a 0.36% gain in zero-shot accuracy compared to the dense model using only 2% of the original pretraining tokens. Additionally, we establish an accurate and robust empirical scaling law to predict sparse model performance given adequate training resources. Finally, we demonstrate the practical applicability of our sparse models by evaluating them under quantization and fine-tuning scenarios.
Abstract（参考訳）: スパシティアウェアトレーニングは,大規模言語モデル(LLM)をハードウェアフレンドリーなスパースパターンに変換するための効果的なアプローチである。本稿では,半構造化(あるいは"N:M")スパースモデルのための,完全連続かつ微分可能なスパース学習フレームワークであるContinuous Adaptive Sparse Trainer (CAST)を提案する。スパーシティパターンと重みを別々に最適化する従来のアプローチとは異なり、CASTはトレーニング中にシームレスなジョイント最適化を可能にし、段階的にモデルを所望のスパーシティフォーマットに変換する。具体的には、CASTは3つの重要なコンポーネントを紹介している。 1)AdamSは、適応L1崩壊を利用して全てのパラメータの均一なスペース化を促進するスペーサである。 2 軽量化は、所望の疎度パターンを保ちながら、崩壊による規模の縮小を緩和するように設計されたモジュールである。 3) 学習効率を高めるため, 自己学習者として高密度モデルを用いた知識蒸留。我々は,CASTを125Mから13Bパラメータの範囲で,複数のモデルファミリーの2:4間隔パターンで評価した。以上の結果から,従来の最先端手法に比べて,トレーニングリソースの最小化による難易度とゼロショット精度の両面で有意な改善が見られた。特に、LLaMA2-7Bでは、我々の2:4スパースモデルは、元の事前学習トークンの2%しか使用していない密度モデルと比較して、0ショット精度で0.09の無視可能なパープレキシティ増加と0.36%のゲインを達成する。さらに、適切なトレーニングリソースを与えられたスパースモデルの性能を予測するために、正確で堅牢な実験的スケーリング法を確立した。最後に、量子化および微調整シナリオ下で評価することでスパースモデルの実用性を示す。

論文の概要: CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models

関連論文リスト