Fugu-MT 論文翻訳(概要): Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

論文の概要: Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

arxiv url: http://arxiv.org/abs/2603.08249v1
Date: Mon, 09 Mar 2026 11:22:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.829197
Title: Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
Title（参考訳）: 合成視覚データを用いたゼロAV音源シナリオにおけるブートストラップ音声音声認識
Authors: Pol Buitrago, Pol Gàlvez, Oriol Pareras, Javier Hernando,
Abstract要約: 本稿では,静的な顔画像と実音声を唇同期して生成した合成視覚ストリームをベースとした,ゼロAV音源のAVSRフレームワークを提案する。我々は700時間以上のトーキングヘッドビデオを合成し、事前訓練されたAV-HuBERTモデルを微調整する。我々のモデルは、パラメータやトレーニングデータが少なくて、ほぼ最先端の性能を達成し、同じ訓練されたオーディオのみのベースラインを上回り、ノイズのマルチモーダル的優位性を保っている。
参考スコア（独自算出の注目度）: 4.911970211082446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Abstract（参考訳）: オーディオ視覚音声認識(AVSR)は、音響的および視覚的手がかりを組み合わせて、困難な条件下での転写の堅牢性を改善するが、訓練用のラベル付きビデオコーパスが欠如しているため、ほとんどのアンダーリソース言語には及ばない。本稿では,静的な顔画像と実音声を唇同期して生成した合成視覚ストリームをベースとした,ゼロAV音源のAVSRフレームワークを提案する。まず、スペイン語のベンチマークで合成視覚増強を評価し、アノテートされた音声視覚コーパスを持たない言語であるカタルーニャ語に適用した。我々は700時間以上のトーキングヘッドビデオを合成し、事前訓練されたAV-HuBERTモデルを微調整する。手動でアノテートしたカタルーニャのベンチマークでは、パラメータやトレーニングデータが少なく、ほぼ最先端のパフォーマンスを実現し、同じ訓練されたオーディオのみのベースラインを上回り、ノイズのマルチモーダル的優位性を保っている。したがって、スケーラブルな合成ビデオは、ゼロAV音源AVSRにおける実際の録音の代替となる。

論文の概要: Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

関連論文リスト