Fugu-MT 論文翻訳(概要): Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

論文の概要: Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

arxiv url: http://arxiv.org/abs/2401.09802v1
Date: Thu, 18 Jan 2024 08:46:02 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-19 17:20:22.126612
Title: Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units
Title（参考訳）: 離散視覚単位を用いた学習による単一モデルによる多言語視覚音声認識
Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro
Abstract要約: 本稿では,1つのモデルを用いた文レベル多言語視覚音声認識について検討する。近年の音声音声ユニットの成功により、自己監督型視覚音声モデルから抽出した視覚音声特徴を識別して、提案した視覚音声ユニットを得る。我々は、従来の言語固有のVSRモデルに匹敵する性能を1つの訓練モデルで達成し、最先端の多言語VSRのパフォーマンスを新たに設定した。
参考スコア（独自算出の注目度）: 59.84564095008798
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores sentence-level Multilingual Visual Speech Recognition with a single model for the first time. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, the proposed visual speech unit is obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. To correctly capture multilingual visual speech, we first train the self-supervised visual speech model on 5,512 hours of multilingual audio-visual data. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases. As both the inputs and outputs are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.
Abstract（参考訳）: 本稿では,単一モデルを用いた文レベルの多言語視覚音声認識を初めて検討する。視覚データの大規模多言語モデリングは膨大な計算コストを必要とするため,視覚音声単位を用いた新しい処理手法を提案する。近年の音声音声ユニットの成功により、自己監督型視覚音声モデルから抽出した視覚音声特徴を識別して、提案した視覚音声ユニットを得る。まず,多言語視聴覚データ5,512時間に対して,自己教師付き視覚音声モデルを訓練した。分析により,視聴覚単位が非言語的情報を抑圧しながら,視覚情報を含むことを検証した。本システムでは,視覚音声単位を入力として,複数のvsrデータベースを融合して構築した大規模多言語データに対して,対応するテキスト出力を予測するモデルを事前学習する。入力と出力の両方が離散的であるため、標準のVSRトレーニングと比較してトレーニング効率を大幅に向上させることができる。具体的には、入力データサイズを元のビデオ入力の0.016%に削減する。音声認識における視覚情報の不足を補うために,音声・視覚音声単位からシステム入力が始まり,徐々に視覚音声単位に変化するカリキュラム学習を適用する。事前トレーニング後、モデルは継続的な機能で微調整される。我々は、従来の言語固有のVSRモデルに匹敵する性能を1つの訓練モデルで達成し、最先端の多言語VSRのパフォーマンスを新たに設定した。

論文の概要: Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

関連論文リスト