Fugu-MT 論文翻訳(概要): Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

論文の概要: Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

arxiv url: http://arxiv.org/abs/2510.08078v1
Date: Thu, 09 Oct 2025 11:08:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.029238
Title: Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Title（参考訳）: Video-to-Audio 生成における挿入幻覚の検出と緩和
Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang,
Abstract要約: Video-to-Audio世代は、ビデオのための自動サウンドに顕著な進歩を遂げた。我々は、この現象を挿入幻覚と呼び、データセットバイアスによって引き起こされるシステム的リスクとみなす。この問題の有病率と重症度を定量化する2つの新しい指標を導入する。 IHを緩和する新しいトレーニングフリー推論時間法であるPosterior Feature Correctionを提案する。
参考スコア（独自算出の注目度）: 29.443084496227026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
Abstract（参考訳）: Video-to-Audio世代は、ビデオのための音声を自動合成する際、顕著な進歩を遂げた。しかし、意味的および時間的アライメントに焦点を当てた既存の評価指標は、重要な障害モードを見落としている:モデルはしばしば、対応する視覚的ソースを持たない音響イベント、特に音声と音楽を生成する。この現象を挿入幻覚(Insertion Hallucination)と呼び、現在の測定値で完全に検出されていないオフスクリーン音の出現率などのデータセットバイアスによって引き起こされるシステム的リスクとみなす。この課題に対処するために,我々はまず,複数の音声イベント検出器の多数投票アンサンブルを利用するシステム評価フレームワークを開発した。 IH@vid(幻覚のあるビデオの分数)とIH@dur(幻覚期間の分数)の2つの新しい指標も導入した。そこで我々は,IHを緩和する新しいトレーニングフリー推論時間法であるPosterior Feature Correctionを提案する。 PFCは2パスの処理で動作し、まず最初の音声出力を生成して幻覚したセグメントを検出し、そのタイムスタンプで対応するビデオ特徴をマスキングした後、オーディオを再生する。いくつかの主流なV2Aベンチマークの実験は、最先端のモデルが深刻なIHに悩まされていることを最初に明らかにした。対照的に,我々のPFC法は,劣化を伴わずに,幻覚の有病率と持続期間を平均50%以上削減する。私たちの研究は、より信頼性が高く忠実なV2Aモデルへの道を開いた、インストール幻覚を正式に定義し、体系的に測定し、効果的に緩和する最初のものです。

論文の概要: Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

関連論文リスト