Fugu-MT 論文翻訳(概要): Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

論文の概要: Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

arxiv url: http://arxiv.org/abs/2511.18890v1
Date: Mon, 24 Nov 2025 08:46:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:25.117846
Title: Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Title（参考訳）: Nemotron-Flash: Latency-Optimal Hybrid Small Language Modelsを目指して
Authors: Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov,
Abstract要約: 本研究の目的は、SLMのリアルタイムレイテンシの主要な決定要因を特定し、SLMの設計とトレーニングのための一般化可能な原則と方法論を提供することである。我々はNemotron-Flashと呼ばれるハイブリッドSLMの新たなファミリーを導入し、最先端SLMの精度・効率のフロンティアを大幅に向上させる。
参考スコア（独自算出の注目度）: 97.55009021098554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.
Abstract（参考訳）: スモールランゲージモデル(SLM)の効率的なデプロイは、待ち時間に制約のある多くの実世界のアプリケーションに不可欠である。 SLM設計に関するこれまでの研究は主にパラメータ最適化SLMを実現するためのパラメータの削減に重点を置いていたが、パラメータ効率は必ずしも比例的な実デバイススピードアップに変換されるとは限らない。本研究の目的は、SLMのリアルタイムレイテンシの主要な決定要因を特定し、実デバイスレイテンシが主な考慮事項である場合、SLMの設計とトレーニングのための一般化可能な原則と方法論を提供することである。具体的には、深さ幅比と演算子選択の2つの中心的アーキテクチャ要素を同定する。前者は小さなバッチサイズのレイテンシに不可欠であり、後者はレイテンシと大きなバッチサイズのスループットの両方に影響する。これを踏まえて、我々は、遅延-最適深さ-幅比を最初に研究し、奥行きモデルが一般に同じパラメータ予算の下でより良い精度を達成しているにもかかわらず、それらは精度-遅延トレードオフフロンティアには当てはまらないことを発見した。次に,提案手法の候補としての可能性を評価するために,より効率的な注目代替案について検討する。提案手法を用いて,これらの演算子の遅延-最適結合をハイブリッドSLM内で自動的に検出し,精度・レイテンシ・フロンティアを向上する進化的探索フレームワークを構築した。アーキテクチャの改善に加えて,より効果的な重み更新と最終収束性向上を可能にする重み正規化技術を用いて,SLMトレーニングをさらに強化する。これらの手法を組み合わせることで、Nemotron-Flashと呼ばれるハイブリッドSLMの新たなファミリーを導入し、Qwen3-1.7B/0.6Bに比べて5.5%以上の平均精度、1.3x/1.9x低レイテンシ、18.7x/45.6x高スループットを実現した。

論文の概要: Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models

関連論文リスト