Fugu-MT 論文翻訳(概要): VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

論文の概要: VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

arxiv url: http://arxiv.org/abs/2601.17868v1
Date: Sun, 25 Jan 2026 15:02:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:08.497098
Title: VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding
Title（参考訳）: VidLaDA: 効率的なビデオ理解のための双方向拡散大言語モデル
Authors: Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin,
Abstract要約: VidDAは拡散言語モデルに基づくビデオLLMである。大規模なビデオトークン上での拡散復号化のボトルネックに対処するためにMARS-Cacheを導入する。実験によると、VidDAは拡散ベースラインを上回り、最先端の自己回帰モデルと競合する。
参考スコア（独自算出の注目度）: 52.69880888587866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.
Abstract（参考訳）: 標準自己回帰ビデオLLMは、必然的に、グローバルな時空間モデリングを妨げる因果性マスキングバイアスに悩まされ、最適な理解効率をもたらす。拡散言語モデルに基づくビデオLLMであるVidLaDAを提案する。大規模なビデオトークン上での拡散復号化の推論ボトルネックにさらに対処するために,MARS-Cacheを導入する。このフレームワークは、非同期のビジュアルキャッシュリフレッシュとフレームワイドのチャンクアテンションを組み合わせることで推論を加速し、アンカートークンによるグローバル接続を保ちながら、事実上冗長性を抑える。大規模な実験では、VidLaDAは拡散ベースラインを上回り、最先端の自己回帰モデル(例えば、Qwen2.5-VL、LLaVA-Video)と競合し、MARS-Cacheは推論精度を損なうことなく12倍のスピードアップを提供する。コードとチェックポイントはhttps://github.com/ziHoHe/VidLaDA.comで公開されている。

論文の概要: VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

関連論文リスト