Fugu-MT 論文翻訳(概要): SpecTr: Fast Speculative Decoding via Optimal Transport

論文の概要: SpecTr: Fast Speculative Decoding via Optimal Transport

arxiv url: http://arxiv.org/abs/2310.15141v2
Date: Thu, 18 Jan 2024 04:42:34 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-19 20:00:28.215637
Title: SpecTr: Fast Speculative Decoding via Optimal Transport
Title（参考訳）: SpecTr: 最適なトランスポートによる高速な投機的デコーディング
Authors: Ziteng Sun and Ananda Theertha Suresh and Jae Hun Ro and Ahmad Beirami and Himanshu Jain and Felix Yu
Abstract要約: このアルゴリズムはデコーディングの高速化を図り、デコードされた出力に品質劣化がないことを保証します。提案手法は,最先端の大規模言語モデルに対して,標準的なベンチマーク上での投機的復号化よりもさらに1.37倍の高速化である2.13Xのウォールクロック高速化を実現することを実験的に実証した。
参考スコア（独自算出の注目度）: 30.18181671899423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
Abstract（参考訳）: 大規模言語モデルからの自己回帰サンプリングは、いくつかの自然言語タスクにおいて最先端の結果をもたらす。しかし、自己回帰サンプリングはトークンを1つずつ生成して遅くし、特定のタスクでは禁止される。サンプリングを高速化する1つの方法は、$\textit{speculative decoding}$: $\textit{draft}$(ブロックまたはトークンのシーケンス)をサンプリングするために小さなモデルを使用して、大きな言語モデルによってドラフト内のすべてのトークンを並列にスコアする。ドラフト中のトークンのサブセットは、最終的な出力が大きなモデルの分布に従うことを保証するための統計的方法に基づいて受け入れられる(そして、残りは拒否される)。本研究では、最適な輸送(OT)のレンズを$\textit{membership cost}$とすることで、投機的復号化の原理的理解を提供する。このフレームワークはよく知られた$\textit{maximal-coupling}$問題の拡張と見なすことができる。この新しい定式化により、投機的復号法を一般化し、トークンレベルで1セットの$k$の候補を可能にすることで、最適なメンバーシップコストが向上します。最適なドラフト選択アルゴリズム(トランスポート計画)は線形プログラミングによって計算できることを示し,その最もよく知られた実行時間は$k$で指数関数的である。次に, 許容確率が(1-1/e)$-optimal multiplicative である有効なドラフト選択アルゴリズムを提案する。さらに、1つのトークンのドメインサイズでほぼ線形に時間で計算することができる。この新たなドラフト選択$アルゴリズムを用いて、デコードされた出力に品質劣化がないことを保証しながらデコードを高速化する、$\textit{SpecTr}$と呼ばれる新しい自動回帰サンプリングアルゴリズムを開発する。提案手法は,最先端の大規模言語モデルに対して,標準的なベンチマーク上での投機的復号化よりもさらに1.37倍の高速化を実現する。

関連論文リスト

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
DELは、推論中に出口層と投機長を適応的に選択するプラグイン・アンド・プレイ方式である。 Delは、全体的なスピードアップを$2.16times$$$sim$$2.50times$ over vanilla auto-regressive decoding で達成している。
論文参考訳（メタデータ） (2025-04-08T01:12:59Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
トークンの制約を評価するのは違法にコストがかかる LCDは文字列上のグローバル分布を歪め、ローカル情報のみに基づいてトークンをサンプリングすることができる。我々のアプローチは最先端のベースラインよりも優れていることを示す。
論文参考訳（メタデータ） (2025-04-07T18:30:18Z)
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference [9.143856130336783]
SuffixDecodingは、投機的復号化を通じて大きな言語モデル(LLM)推論を加速するためのモデルなしのアプローチである。当社のアプローチは,新たなモデルの維持と編成のオーバーヘッドを伴わずに,柔軟な木構造推測を可能にする。プロプライエタリなマルチLLMテキスト・ツー・トーケンアプリケーションでは、SuffixDecodingは2.9倍の出力スループットと3倍のレイテンシを実現している。
論文参考訳（メタデータ） (2024-11-07T18:49:33Z)
Graph-Structured Speculative Decoding [52.94367724136063]
投機的復号化は、大規模言語モデルの推論を加速する有望な手法として登場した。本稿では, 有向非巡回グラフ(DAG)を応用して, 起案された仮説を管理する革新的な手法を提案する。我々は1.73$times$から1.96$times$に顕著なスピードアップを観察し、標準投機的復号法を大幅に上回った。
論文参考訳（メタデータ） (2024-07-23T06:21:24Z)
Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
差分プライバシーのための入力摂動フレームワークを再検討し、入力にノイズを付加する。まず、ペアワイズ・コサイン類似性をプライベートにリリースするための新しい効率的なアルゴリズムを設計する。我々は,$k$の辺縁クエリを$n$の機能に対して計算する新しいアルゴリズムを導出する。
論文参考訳（メタデータ） (2024-06-07T12:07:16Z)
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass [72.07642648108849]
Superposed Decodingは、1つの自己回帰推論パスのコストで$k$のドラフトを生成する新しい復号アルゴリズムである。 Superposed Decodingは、他のデコード戦略と組み合わせることで、推論時間計算のスケーリング時に普遍的なカバレッジが向上する。
論文参考訳（メタデータ） (2024-05-28T17:40:48Z)
Some Notes on the Sample Complexity of Approximate Channel Simulation [2.4554686192257424]
チャネルシミュレーションアルゴリズムは、所定のターゲット分布のランダムサンプルを$Q$で効率的にエンコードし、機械学習ベースの損失データ圧縮における応用を見つけることができる。本稿では,固定ランタイムを用いた近似スキームについて考察する。 D_KL[Q Vert P] + o(1)) Big/epsilonbigのみのサンプル複雑さで、$mathrmTV[Q Vert P] leq epsilon$を確保し、最適な符号化性能を維持するために、グローバルバウンドの深度制限A*符号化を利用する。
論文参考訳（メタデータ） (2024-05-07T14:44:41Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
投機的復号化(英: Speculative decoding)は、大規模言語モデルの生成プロセスを加速する広く使われている手法である。我々は,草案作成プロセスの並列化のために,草案文を生成するOuroborosを紹介した。ウロボロは投機的復号化で最大2.8倍、バニラ復号化で3.9倍のスピードアップを達成できる。
論文参考訳（メタデータ） (2024-02-21T11:31:28Z)
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference [17.947904697850433]
バッチ推論とKeyValueキャッシュのためのトークンレベルの早期終了メソッドであるSkipDecodeを提案する。これは、各シーケンス位置のバッチ内の各トークンに対して特異レベル出口を設定することで、以前の制約を克服する。また、イグジットポイントの単調な減少を保証するため、前のトークンに対してKVキャッシュを再コンパイルする必要がなくなる。
論文参考訳（メタデータ） (2023-07-05T19:59:09Z)
Truncation Sampling as Language Model Desmoothing [115.28983143361681]
ニューラルネットワークモデルからのテキストの長いサンプルは、品質が劣る可能性がある。トランケーションサンプリングアルゴリズムは、各ステップでいくつかの単語の確率を0に設定する。本稿では,単語をエントロピーに依存した確率閾値以下に切り詰める$eta$-samplingを導入する。
論文参考訳（メタデータ） (2022-10-27T05:52:35Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
縮退した線形マルコフ+デルタ決定における最適同定問題について, 生成モデルに基づく固定信頼度設定における検討を行った。複雑な非最適化プログラムの解としての下位境界は、そのようなアルゴリズムを考案する出発点として用いられる。
論文参考訳（メタデータ） (2022-08-11T04:12:50Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。