Fugu-MT 論文翻訳(概要): Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

論文の概要: Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

arxiv url: http://arxiv.org/abs/2602.20981v2
Date: Wed, 25 Feb 2026 02:22:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-26 13:37:25.579578
Title: Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Title（参考訳）: 時間とともにエコー:ビデオからオーディオ生成モデルにおける長さ一般化の鍵を開ける
Authors: Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji,
Abstract要約: マルチモーダル・ツー・オーディオ生成におけるスケーリングの課題に対処し、短いインスタンスでトレーニングされたモデルがテスト中により長いインスタンスに一般化できるかどうかを検討する。提案手法は階層的手法と非因果的Mambaを統合し,長大な音声生成を支援する。実験の結果,提案手法は,ビデオ・オーディオ・タスクの先行作業に勝る長大な結果が得られることがわかった。
参考スコア（独自算出の注目度）: 42.75068463173552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
Abstract（参考訳）: ビデオとオーディオ間のマルチモーダルアライメントのスケーリングは、特に限られたデータとテキスト記述とフレームレベルのビデオ情報のミスマッチのため、難しい。本研究では,マルチモーダル・ツー・オーディオ生成におけるスケーリングの課題に取り組み,ショート・インスタンスでトレーニングしたモデルがテスト中により長いモデルに一般化できるかどうかを検討する。この課題に対処するため,マルチモーダル階層型ネットワークMMHNetを提案する。提案手法は階層的手法と非因果的Mambaを統合し,長大な音声生成を支援する。提案手法は,5分以上の長大な音声生成を著しく改善する。また,長時間のトレーニングを伴わないビデオ・オーディオ生成タスクにおいて,短時間・長時間のトレーニングが可能であることも証明した。実験の結果,提案手法は,ビデオ・オーディオ・タスクの先行作業に勝る長大な結果が得られることがわかった。さらに,本モデルでは,5分間以上生成するのに対して,先行する音声合成手法では長寿命で生成が不十分であることを示す。

論文の概要: Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

関連論文リスト