Fugu-MT 論文翻訳(概要): Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

論文の概要: Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

arxiv url: http://arxiv.org/abs/2109.09161v1
Date: Sun, 19 Sep 2021 16:39:22 GMT
ステータス: 翻訳完了
システム内更新日: 2021-09-21 16:23:09.793007
Title: Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition
Title（参考訳）: wav-bert:低リソース音声認識のための協調音響・言語表現学習
Authors: Guolin Zheng, Yubei Xiao, Ke Gong, Pan Zhou, Xiaodan Liang, Liang Lin
Abstract要約: Wav-BERTは、協調的な音響および言語表現学習法である。我々は、事前訓練された音響モデル(wav2vec 2.0)と言語モデル(BERT)をエンドツーエンドのトレーニング可能なフレームワークに統合する。
参考スコア（独自算出の注目度）: 159.9312272042253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.
Abstract（参考訳）: 音声および言語表現学習の統合は,低音源音声認識のための高音源言語データの豊富な知識を伝達するためにますます重要になっている。既存のアプローチは、音声からテキストへの転送を学ぶために、事前学習された音響モデルと言語モデルを単にカスケードする。しかし、音声とテキストの表現の相違をどう解決するかは未解明であり、音響情報や言語情報の活用を妨げる。さらに、事前学習された言語モデルの埋め込み層を音響的特徴に置き換えることで、破滅的な忘れ問題を引き起こす可能性がある。本研究では,音声とテキストの文脈情報を融合・活用するための協調音響・言語表現学習手法であるWav-BERTを紹介する。具体的には、事前訓練された音響モデル(wav2vec 2.0)と言語モデル(BERT)をエンドツーエンドのトレーニング可能なフレームワークに統合する。表現集約モジュールは音響表現と言語表現を集約するために設計され、bertに音響情報を組み込むために埋め込み注意モジュールが導入され、2つの事前学習モデルの協調を効果的に促進し、表現学習を促進することができる。広汎な実験により,我々のWav-BERTは既存の手法よりも優れ,低音源音声認識における最先端性能を実現していることがわかった。

関連論文リスト

Linguistic Knowledge Transfer Learning for Speech Enhancement [29.191204225828354]
言語知識は、言語理解において重要な役割を果たす。ほとんどの音声強調法は、雑音とクリーンな音声のマッピング関係を学習するために音響的特徴に依存している。本稿では,言語知識をSEモデルに統合するクロスモーダル・ナレッジ・トランスファー(CMKT)学習フレームワークを提案する。
論文参考訳（メタデータ） (2025-03-10T09:00:18Z)
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings [8.660203441911554]
本稿では,音声データを利用した言語モデルの学習手法を提案する。これにより、テスト時のオーディオ処理オーバーヘッドを回避しつつ、音声書き起こしを解析するための言語モデルが改善される。本実験では, 従来の言語モデルに対して, 音声書き起こし解析のタスクにおいて一貫した改善が達成された。
論文参考訳（メタデータ） (2023-11-13T01:53:12Z)
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
本稿では,話者ダイアリゼーションシステムにおける意味情報を活用する新しい手法を提案する。音声言語理解モジュールを導入し、話者関連意味情報を抽出する。本稿では,これらの制約を話者ダイアリゼーションパイプラインに統合する新しい枠組みを提案する。
論文参考訳（メタデータ） (2023-09-19T09:13:30Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
本研究では、2つのエンコーダを用いて音素と音声を複数モーダル空間に導入するCTAP(Contrastive Token-Acoustic Pretraining)を提案する。提案したCTAPモデルは、210k音声と音素ペアで訓練され、最小教師付きTS、VC、ASRを実現する。
論文参考訳（メタデータ） (2023-09-01T12:35:43Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
VATLM (Visual-Audio-Text Language Model) を用いたクロスモーダル表現学習フレームワークを提案する。提案したVATLMは、モダリティに依存しない情報をモデル化するために、統一されたバックボーンネットワークを使用する。これら3つのモダリティを1つの共有セマンティック空間に統合するために、VATLMは統一トークンのマスク付き予測タスクで最適化される。
論文参考訳（メタデータ） (2022-11-21T09:10:10Z)
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings [19.195728241989702]
本稿では,トップダウン語彙知識を音響単語埋め込みの訓練手順に組み込んだマルチタスク学習モデルを提案する。我々は3つの言語で実験を行い、語彙知識を取り入れることで、埋め込み空間の識別性が向上することを示した。
論文参考訳（メタデータ） (2022-09-14T13:33:04Z)
Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
本研究では、事前学習された言語モデルを用いて、文章の感情情報を学習し、音声の感情分析を行う。本稿では,言語モデルを用いた擬似ラベルに基づく半教師付き訓練戦略を提案する。
論文参考訳（メタデータ） (2021-06-11T20:15:21Z)
Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning [4.327558819000435]
音声表現を学習するための新しいテキスト音声前訓練手法を提案する。音声言語理解ベンチマークであるFluent Speech CommandsとSNIPSの実験結果から,提案手法は強いベースラインモデルよりも有意に優れていることが示された。
論文参考訳（メタデータ） (2021-04-21T05:19:13Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
音声理解には、入力音響信号を解析してその言語内容を理解し、予測するモデルが必要である。大規模無注釈音声やテキストからリッチな表現を学習するために,様々な事前学習手法が提案されている。音声と言語モジュールを協調的に事前学習するための,新しい半教師付き学習フレームワークであるSPLATを提案する。
論文参考訳（メタデータ） (2020-10-05T19:29:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。