Fugu-MT 論文翻訳(概要): DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

論文の概要: DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

arxiv url: http://arxiv.org/abs/2605.20936v1
Date: Wed, 20 May 2026 09:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.595906
Title: DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU
Title（参考訳）: DASH: 単一GPU上での分単位でのハイブリッドアテンションのための高速な微分可能なアーキテクチャ検索
Authors: Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie,
Abstract要約: DASHはハイブリットアテンションアーキテクチャ設計のための高速で微分可能な検索フレームワークである。個別のレイヤワイド・アテンション・オペレーターを継続的アーキテクチャ・ロジットに配置する。再利用可能な教師整列線形候補を作成し、モデルと演算子重みを凍結したアーキテクチャのみの探索を行う。
参考スコア（独自算出の注目度）: 62.52524380866359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.
Abstract（参考訳）: ハイブリッドアテンションアーキテクチャは、モデル品質を維持しながらLLM推論効率を向上させるための重要なパラダイムとなりつつあり、ハイブリッドアーキテクチャ設計が中心的な問題となっている。既存の設計は、しばしばレイヤーワイド演算子割り当てのための手動経験則やプロキシベースのセレクタ信号に依存している。最近のNASスタイルのシステムであるJet-Nemotronは、自動ハイブリッドアーキテクチャサーチの可能性を実証している。しかし、Jet-NemotronのPostNASサーチステージは200Bトークンのみを使用しており、このようなサーチパイプラインをハイブリッドアーキテクチャ設計のルーチン手法として使うのが困難である。 DASHはハイブリットアテンションアーキテクチャ設計のための高速な微分可能な検索フレームワークであり、連続的なアーキテクチャロジットへの個別のレイヤーワイズ演算子配置を緩和し、再利用可能な教師整列線形候補を作成し、探索効率を大幅に向上するためにモデルと演算子重みを凍結したアーキテクチャのみの探索を行う。 Qwen2.5-3B-Instructでは、DASHは既存のセレクタスタイルのハイブリッドアテンション設計ベースラインの包括的スイートよりも優れており、直接微分可能な検索がより強力なハイブリッドアーキテクチャを発見できることを示している。さらに、DASHはリリースしたJet-Nemotronモデルよりも強力なRULER性能を達成しつつ、オーバーラップしたショートコンテクストと一般的なベンチマークで競争力を維持する。注目すべきは、各DASHサーチランは12.3Mトークンしか使用せず、単一のRTX Pro 6000 GPUで約20分かかり、Jet-Nemotronが報告したPostNASサーチトークンの0.006%に相当する。これらの結果から,高品質なハイブリットアテンションアーキテクチャは分レベルの微分可能探索によって得られることが示唆され,ハイブリットアテンションアーキテクチャ設計における有望な方向性が示唆された。

論文の概要: DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

関連論文リスト