Fugu-MT 論文翻訳(概要): DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

論文の概要: DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

arxiv url: http://arxiv.org/abs/2604.08084v1
Date: Thu, 09 Apr 2026 10:56:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.872398
Title: DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning
Title（参考訳）: DiffVC:ビデオキャプションのための拡散モデルに基づく非自己回帰型フレームワーク
Authors: Junbo Wang, Liangyu Fu, Yuke Li, Yining Zhu, Ya Jing, Xuecheng Wu, Jiangbin Zheng,
Abstract要約: ビデオキャプションのための拡散モデルに基づく非自己回帰フレームワーク(DiffVC)を提案する。筆者らが提案する識別条件拡散モデルにより,高品質なテキスト記述を生成できる。 MSVD, MSR-VTT, VATEXによる実験により, 本手法は従来の非自己回帰法よりも優れていることが示された。
参考スコア（独自算出の注目度）: 20.00706494207555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.
Abstract（参考訳）: 現在のビデオキャプション法は、通常、エンコーダ・デコーダ構造を使って自動回帰的にテキストを生成する。しかし、自己回帰法には、遅い生成速度や大きな累積誤差のような固有の制限がある。さらに,マルチモーダル相互作用モデリングが不十分なため,非自己回帰モデルでは生成品質が低下する。そこで本稿では,ビデオキャプション(DiffVC)の拡散モデルに基づく非自己回帰フレームワークを提案する。その並列デコーディングは、生成速度と累積誤差の問題を効果的に解くことができる。同時に,提案した識別的条件拡散モデルにより,高品質なテキスト記述を生成できる。具体的には、まずビデオを視覚表現にエンコードする。訓練中、接頭辞の文章表現にガウスノイズが付加される。そして、視覚的表現を条件制約とする識別的識別器を介して、新たなテキスト表現を生成する。最後に,新しいテキスト表現を非自己回帰言語モデルに入力し,キャプションを生成する。推定中は、ガウス分布から直接ノイズをサンプリングして生成する。 MSVD, MSR-VTT, VATEXを用いた実験により, 提案手法は従来の非自己回帰法よりも優れ, 自己回帰法に匹敵する性能が得られ, CIDErでは最大9.9, B@4では2.6向上し, 生成速度は向上した。ソースコードはまもなく公開される予定だ。

論文の概要: DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

関連論文リスト