Fugu-MT 論文翻訳(概要): TiDAR: Think in Diffusion, Talk in Autoregression

論文の概要: TiDAR: Think in Diffusion, Talk in Autoregression

arxiv url: http://arxiv.org/abs/2511.08923v1
Date: Thu, 13 Nov 2025 01:18:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-13 22:34:54.303893
Title: TiDAR: Think in Diffusion, Talk in Autoregression
Title（参考訳）: TiDAR: 拡散と自己回帰について考える
Authors: Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov,
Abstract要約: TiDARは、Diffusionでトークン(Thinking)をドラフトし、最終的な出力(Talking)をAutoRegressivelyにサンプリングするシーケンスレベルのハイブリッドアーキテクチャである。 TiDARはARモデルと品質ギャップを埋める最初のアーキテクチャであり、毎秒4.71倍から5.91倍のトークンを提供する。
参考スコア（独自算出の注目度）: 59.94106070312094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
Abstract（参考訳）: 拡散言語モデルは高速並列生成の可能性を保ち、一方自己回帰(AR)モデルは、言語モデリングと自然に一致する因果構造のため、通常、品質が優れている。高いスループット、高いGPU利用、そしてARレベルの品質でシナジーを達成できるだろうか? 既存の手法はこれら2つの側面を効果的にバランスさせることに失敗し、例えば、逐次起草(投機的復号)のために弱いモデルを用いてARを優先順位付けし、より低い起草効率をもたらすか、あるいは拡散のためにある種の左から右(ARに似た)復号論理を使用するか、品質劣化に悩まされ、潜在的に並列化性を失うかのどちらかである。我々は、Diffusionでトークン(Thinking)をドラフトし、最終的な出力(Talking)をAutoRegressivelyにサンプリングするシーケンスレベルのハイブリッドアーキテクチャTiDARを紹介します。この設計では、無償のGPU計算密度を利用して、ドラフトと検証能力のバランスを保っている。さらに、TiDARはスタンドアロンモデルとしてサービスフレンドリな(オーバーヘッドの低い)ように設計されている。 1.5Bスケール, 8Bスケールで, 生成タスク, 可能性タスク間でのTiDARのARモデル, 投機的復号化, 拡散変異を広範囲に評価した。並列のドラフトとサンプリングに加えて、正確なKVキャッシュサポートのおかげで、TiDARは計測スループットで投機的デコーディングを上回り、DreamやLladaといった拡散モデルを効率と品質の両方で上回っている。 TiDARはARモデルと品質ギャップを埋める最初のアーキテクチャであり、毎秒4.71倍から5.91倍のトークンを提供する。

論文の概要: TiDAR: Think in Diffusion, Talk in Autoregression

関連論文リスト