Fugu-MT 論文翻訳(概要): Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

論文の概要: Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

arxiv url: http://arxiv.org/abs/2604.25819v1
Date: Tue, 28 Apr 2026 16:28:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.949491
Title: Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Title（参考訳）: 相互強制: 高速自動回帰オーディオ映像キャラクタ生成のためのデュアルモード自己進化
Authors: Yupeng Zhou, Lianghua Huang, Zhifan Wu, Jiabao Wang, Yupeng Shi, Biao Jiang, Daquan Zhou, Yu Liu, Ming-Ming Cheng, Qibin Hou,
Abstract要約: Mutual Forcingは、長時間のオーディオ-ビデオ同期を伴う高速自動回帰オーディオ-ビデオ生成のためのフレームワークである。提案手法は,ジョイント・オーディオ・ビデオ・モデリングと高速自己回帰生成という2つの課題に対処する。
参考スコア（独自算出の注目度）: 86.69465782855086
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.
Abstract（参考訳）: 本研究では,Mutual Forcingを提案する。Mutual Forcingは,長時間の音声-ビデオ同期を伴う高速自動回帰音声-ビデオ生成のためのフレームワークである。提案手法は,ジョイント・オーディオ・ビデオ・モデリングと高速自己回帰生成という2つの課題に対処する。まず、単モードジェネレータを訓練し、それらをペア化されたデータで共同トレーニングするための統合オーディオビデオモデルにまとめる。ストリーミング生成では, 従来のストリーミング蒸留パイプラインに従わず, ネイティブな高速因果音声-ビデオモデルを直接訓練できるかどうかを問う。私たちの答えは、Mutual Forcingで、ネイティブな自己回帰モデルに基づいて構築され、1つの重み付けモデルに数ステップと複数ステップの生成を統合することで、自己蒸留を可能にし、トレーニングと推論の整合性を改善します。マルチステップモードは自己蒸留により数ステップモードを改善し、数ステップモードはトレーニング中に履歴コンテキストを生成してトレーニングと推論の一貫性を改善する。自己強制(Self-Forcing)のような従来のアプローチと比較して、Mutual Forcingは、追加の双方向教師モデルの必要性を排除し、より柔軟なトレーニングシーケンス長をサポートし、トレーニングオーバーヘッドを低減し、固定された教師ではなく、実際のペアデータから直接改善することができる。実験の結果、Multual Forcing Matchs or exceeds strong baselines that have around 50 sample steps while only using 4 to 8 steps, evidence significant advantage in efficiency and quality。プロジェクトのページはhttps://mutualforcing.github.ioで公開されている。

論文の概要: Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

関連論文リスト