Fugu-MT 論文翻訳(概要): Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

論文の概要: Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

arxiv url: http://arxiv.org/abs/2412.12639v2
Date: Thu, 26 Dec 2024 06:20:21 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-30 17:55:26.615319
Title: Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
Title（参考訳）: Falcon: 半自動描画とカスタムデコードツリーによる大規模言語モデルの高速かつ並列推論
Authors: Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji,
Abstract要約: Falconは、ドラフト作成者の並列性と出力品質の両方を増強するために設計された革新的な投機的復号化フレームワークである。 FalconにはCoupled Sequential Glancing Distillation(英語版)技術が組み込まれている。
参考スコア（独自算出の注目度）: 7.438117410146904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.
Abstract（参考訳）: 最小の起草遅延と高い投機精度のバランスを取ることで、大規模言語モデルの推論速度を高めることは、投機的復号化において重要な課題である。本稿では,プロダクタの並列性と出力品質を両立させる,革新的な半自己回帰的投機的復号化フレームワークであるFalconを紹介する。 FalconにはCoupled Sequential Glancing Distillation(英語版)技術が組み込まれている。基礎となるメカニズムを照らすための包括的な理論的分析を提供する。さらに,1つのフォワードパスで複数のトークンを生成でき,必要に応じて複数のフォワードパスを許容できるCustom-Designed Decoding Treeを導入し,ドラフトされたトークンの数を増大させ,全体的な受け入れ率を大幅に向上させる。 MT-Bench、HumanEval、GSM8Kといったベンチマークデータセットの包括的な評価は、Falconの優れた加速能力を示している。このフレームワークは、VicunaモデルとLLaMA2-Chatモデルでテストした場合、損失のないスピードアップ比が2.91倍から3.51倍になる。これらの結果は、Eagle、Medusa、Lookahead、SPS、PLDなど、LLMの既存の投機的復号法を超越しつつ、2つのトランスフォーマー層に相当するコンパクトなドラフトラアーキテクチャを維持しながら、超越している。

関連論文リスト

Accelerating Diffusion LLMs via Adaptive Parallel Decoding [50.9948753314669]
並列にサンプリングされたトークンの数を動的に調整する新しい手法であるアダプティブ並列復号法(APD)を導入する。 APDは、ダウンストリームベンチマークで最小限の品質劣化を伴って、非常に高いスループットを提供する。
論文参考訳（メタデータ） (2025-05-31T06:10:10Z)
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [4.268504966623081]
本稿では,長いシーケンスにおける投機的復号化性能を向上させるドロップインエンハンスメントであるSpecExtendを紹介する。 SpecExtendは、FlashAttentionやHybrid Tree Attentionといった効率的な注意メカニズムをドラフトモデルとターゲットモデルの両方に統合する。そこで我々は,新しいKVキャッシュ更新戦略であるCross-model Retrievalを提案する。
論文参考訳（メタデータ） (2025-05-27T06:30:00Z)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
投機的復号(SD)は、より小さなドラフトモデルを用いて複数のトークンを予測することで、大きな言語モデル推論を加速する。本稿では,専門家の混在(Mixture of Experts, MoE)を利用したJakiroを提案する。提案手法は予測精度を大幅に向上し,推論高速化を実現する。
論文参考訳（メタデータ） (2025-02-10T09:24:06Z)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
提案するParallelSpecは,最先端の投機的復号化手法における自己回帰的起草戦略の代替となる。投機段階における自己回帰的起草とは対照的に,効率的な投機モデルとして機能する並列投機を訓練する。
論文参考訳（メタデータ） (2024-10-08T01:05:08Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
投機的復号化は,大規模言語モデル推論を高速化する手法として広く採用されている。本稿では,離散拡散モデルを用いてドラフトシーケンスを生成する投機的復号法を提案する。
論文参考訳（メタデータ） (2024-08-10T21:24:25Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
本稿では,複数連続するトークンを1つのフォワードパスで同時に復号する,新しい並列復号法,すなわちthithidden Transferを提案する。加速度測定では,Medusa や Self-Speculative decoding など,単モデル加速技術よりも優れています。
論文参考訳（メタデータ） (2024-04-18T09:17:06Z)
SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens [4.5888031410244885]
意味適応トークン(SDSAT)を用いた投機的復号化による大規模言語モデル(LLM)の高速化手法を提案する。この設計の主な目的は、LLMモデルの精度を損なうことなく、より正確にドラフトトークンを生成する能力を高めることである。 CodeLlama-13B と 7B で実施された実験では、それぞれ3.5X と 3.0X 以上の速度向上が達成されている。
論文参考訳（メタデータ） (2024-03-27T14:54:27Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
投機的サンプリングに特化して設計された新しいフレームワークを提案する。このフレームワーク内では、以前に生成されたトークンを効果的に活用し、後続の単語を予測する軽量なドラフトモデルを導入する。我々は、バニラ自動回帰復号方式と比較して平均遅延速度比が2.7倍になるという印象的な結果を示した。
論文参考訳（メタデータ） (2024-02-24T08:10:39Z)
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding [11.832919020149891]
本研究の目的は,数十億のパラメータを持つ大規模言語モデル(LLM)の推論速度を高速化することである。 textbfSmart textbfParallel textbfAuto-textbfCorrect dtextbfEcoding (SPACE)を提案する。
論文参考訳（メタデータ） (2024-02-19T03:39:10Z)
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding [43.659680579686544]
本稿では,浅層深度モジュールと並列デコーディングを併用したFast and Robust Early-Exitingフレームワークを提案する。我々のフレームワークは、既存のトークンの復号処理を、以前に積み重ねられた早期発行トークンと同期させることで、より高速な推論を可能にする。並列デコーディングにより,浅層モデルと深部モデルの両方からの予測を観測できるので,新しい適応しきい値推定器を提案する。
論文参考訳（メタデータ） (2023-10-09T05:53:05Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
そこで我々はParaformerと呼ばれる高速かつ高精度な並列トランスを提案する。出力トークンの数を正確に予測し、隠れた変数を抽出する。 10倍以上のスピードアップで、最先端のARトランスフォーマーに匹敵するパフォーマンスを実現することができる。
論文参考訳（メタデータ） (2022-06-16T17:24:14Z)
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation [80.2267931231335]
本稿では,自己回帰(AR)デコーディングを高速化する投機的実行のアイデアを活用するための投機的デコーディング(SpecDec)を提案する。 SpecDecには2つのイノベーションがある。Spec-Drafter - 効率的なドラフトのために特別に最適化された独立モデル、Spec-Verification - ドラフトされたトークンを効率的に検証するための信頼性の高い方法である。
論文参考訳（メタデータ） (2022-03-30T17:27:09Z)
Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring [83.32560748324667]
本稿では,非自己回帰モデルに基づく効率的なエンドツーエンド音声翻訳(E2E-ST)フレームワークについて述べる。我々は,共有エンコーダ上にNARデコーダと補助的な浅層ARデコーダを備えた,Orthrosと呼ばれる統一NAR E2E-STフレームワークを提案する。
論文参考訳（メタデータ） (2021-09-09T16:50:16Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
我々は,全てのターゲットトークンを同時に生成する非自己回帰(NAR)リップリーダーモデルであるFastLRを提案する。 FastLRは最先端のリップリーダーモデルと比較して10.97$times$のスピードアップを実現している。
論文参考訳（メタデータ） (2020-08-06T08:28:56Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。