Fugu-MT 論文翻訳(概要): Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

論文の概要: Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

arxiv url: http://arxiv.org/abs/2206.07327v1
Date: Wed, 15 Jun 2022 07:20:28 GMT
ステータス: 翻訳完了
システム内更新日: 2022-06-16 14:30:49.542496
Title: Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition
Title（参考訳）: 高齢者・変形性音声認識におけるクロスドメインおよびクロスリンガル超音波舌画像の特徴
Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Tianzi Wang, Xunying Liu, Helen Meng
Abstract要約: 調音機能は本質的に音響信号歪みに不変であり、音声認識システムにうまく組み込まれている。本稿では,A2Aモデルにおける24時間TaLコーパスの並列音声・視覚・超音波舌画像(UTI)データを利用した,クロスドメインおよびクロスランガルA2Aインバージョン手法を提案する。生成した調音機能を組み込んだ3つのタスクの実験は、ベースラインハイブリッドTDNNとConformerベースのエンドツーエンドシステムよりも一貫して優れていた。
参考スコア（独自算出の注目度）: 57.63552541911143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio, visual and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline hybrid TDNN and Conformer based end-to-end systems constructed using acoustic features only by statistically significant word error rate or character error rate reductions up to 2.64%, 1.92% and 1.21% absolute (8.17%, 7.89% and 13.28% relative) after data augmentation and speaker adaptation were applied.
Abstract（参考訳）: 調音機能は本質的に音響信号の歪みに不変であり、正常音声用に設計された自動音声認識(ASR)システムにうまく組み込まれている。言語にまたがる高齢者や無秩序な発話などの非定型課題領域への実践的応用は、ターゲット話者からそのような専門的データを収集することの難しさによって制限されることが多い。本稿では,A2Aモデルにおける24時間TaLコーパスの並列音声・視覚・超音波舌画像(UTI)データを用いて,2つの言語にまたがる3つのデータセットに事前学習を行い,その2つの言語を横断的に適用する手法を提案する: 英語のDementiaBank PittとCandonese JCCOCC MoCA 音声コーパス,および英語のTORGO 音声データ。データ拡張と話者適応の後に、統計的に有意な単語誤り率または文字誤り率を最大2.64%、9.2%、1.21%絶対(8.17%、7.89%、13.28%相対)まで減らすだけで音響特徴を用いて構築された音響特徴量を用いて構築された、ベースラインハイブリッドtdnnとコンフォーメータベースのエンドツーエンドシステムとを一貫して上回っていた。

論文の概要: Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

関連論文リスト