Fugu-MT 論文翻訳(概要): DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

論文の概要: DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

arxiv url: http://arxiv.org/abs/2601.21824v1
Date: Thu, 29 Jan 2026 15:10:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.920557
Title: DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
Title（参考訳）: DASH:高スループット再生可能LDMトレーニングのための決定論的注意スケジューリング
Authors: Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo,
Abstract要約: FlashAttention-3のような広く使われているアテンション実装では、決定論的後方通過は37.9%のスループット低下を引き起こす。我々は、DAG(Directed Acyclic Graph)上のスケジューリング問題として、決定論的注意の後方通過を定式化する。本稿では2つの相補的なスケジューリング戦略をカプセル化したDASH(Deterministic Attention Scheduling for High-Throughput)を提案する。
参考スコア（独自算出の注目度）: 22.898073682504023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.
Abstract（参考訳）: 決定論は大規模言語モデル(LLM)訓練において再現性には不可欠であるが、しばしば性能の急激なコストを正確に表す。 FlashAttention-3のような広く使われている注目実装では、決定論的後方通過は非決定論的パスと比較して37.9%のスループット低下を引き起こす。この性能損失は、計算と勾配-減算フェーズの最適以下のスケジューリングに起因し、ハードウェアの大幅な不使用に繋がる。この課題に対処するために、DAG(Directed Acyclic Graph)上のスケジューリング問題として決定論的注意の後方通過を定式化し、臨界経路長を最小化するスケジュールを導出する。この定式化に基づいてDASH(Deterministic Attention Scheduling for High-Throughput)を提案する。 i)Q-Tile Iterationの廃止。Q-Tile Iterationは、パイプラインストールを因果的注意で縮小する逆クエリブロックトラバーサルで、 (II)シフトスケジューリング(Shift Scheduling)は、DAGモデル内で理論的に最適なスケジュールであり、完全なマスクと因果マスクの両方のパイプラインストールを削減する。 NVIDIA H800 GPUに対する実証的な評価は、DASHが決定論的注意力の差を狭めることを示している。提案手法は, ベースラインに比べて1.28$\times$のスループットを向上し, 再現可能なLCMトレーニングの効率を大幅に向上させる。私たちのコードはhttps://github.com/SJTU-Liquid/deterministic-FA3.comでオープンソース化されています。

論文の概要: DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

関連論文リスト