Fugu-MT 論文翻訳(概要): Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

論文の概要: Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

arxiv url: http://arxiv.org/abs/2605.12825v1
Date: Tue, 12 May 2026 23:47:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.727755
Title: Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Title（参考訳）: Orthrus: デュアルビュー拡散によるメモリ効率の良い並列トークン生成
Authors: Chien Van Nguyen, Chaitra Hegde, Van Cuong Pham, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen,
Abstract要約: Orthrusは、自己回帰型大規模言語モデル(LLM)の正確な生成忠実度と、拡散モデルの高速並列トークン生成を一体化するフレームワークである。最大7.8倍のスピードアップを実現し、メモリキャッシュのオーバーヘッドはO(1)のみであり、パラメータの追加は最小限である。
参考スコア（独自算出の注目度）: 91.43717463458812
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.
Abstract（参考訳）: 本稿では,自己回帰型大言語モデル(LLM)の正確な生成忠実度と拡散モデルの高速並列トークン生成を一体化する,シンプルで効率的な二重アーキテクチャフレームワークであるOrthrusを紹介する。標準自己回帰復号のシーケンシャルな性質は、高スループット推論の基本的なボトルネックである。拡散言語モデルは並列生成によってこの障壁を破ろうとするが、性能の大幅な低下、高いトレーニングコスト、厳密な収束保証の欠如に悩まされる。オルトラスは、この二分法を自然に解決する。既存のTransformerにシームレスに統合するために設計されたこのフレームワークは、軽量でトレーニング可能なモジュールで凍結したLLMを拡張し、標準の自己回帰ビューと並行して並列拡散ビューを作成する。この統合システムでは、どちらのビューも全く同じ高忠実なキーバリュー(KV)キャッシュに対応し、オートレグレッシブヘッドはコンテキストプリフィルを実行して正確なKV表現を構築し、拡散ヘッドは並列生成を実行する。 2つのビュー間の正確なコンセンサス機構を利用することで、Orthrusは損失のない推論を保証し、最大7.8倍のスピードアップを実現し、O(1)メモリキャッシュのオーバーヘッドと最小限のパラメータの追加しかできない。

論文の概要: Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

関連論文リスト