Fugu-MT 論文翻訳(概要): Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

論文の概要: Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

arxiv url: http://arxiv.org/abs/2311.04534v2
Date: Mon, 5 Feb 2024 02:42:57 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-07 04:05:01.545943
Title: Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Title（参考訳）: 離散整合型ASRのためのデコーダのみ変換器の損失マスキングは不要
Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang
Abstract要約: 統一音声テキストモデルは、様々な音声タスクにおいて顕著な性能を達成した。テキストに似た自己回帰的な方法で音声トークンをモデル化することを提案する。入力音声トークンに従来のクロスエントロピー損失を適用することは、ASRの性能を常に向上させるものではない。
参考スコア（独自算出の注目度）: 58.136778669618096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
Abstract（参考訳）: 近年, speechgpt, viola, audiopalmなどの統一音声テキストモデルが様々な音声タスクにおいて顕著な性能を発揮している。これらのモデルは音声信号をトークンに識別し(音声識別)、テキストと音声のトークンの両方に共有語彙を使用する。そして、1つのデコーダのみのトランスフォーマーを複数の音声タスクで訓練する。しかし、これらのモデルは音声トークン間の依存を無視したASRタスクのロスマスキング戦略に依存している。本稿では,テキストと同様に自己回帰的に音声トークンをモデル化することを提案する。従来のクロスエントロピー損失を入力音声トークンに適用しても,ロスマスキング方式よりもASR性能が常に向上しないことがわかった。この問題に対処するため,スムーズなラベル付きKL分散損失を音声トークンに適用する,Smoothed Label Distillation (SLD) という新しい手法を提案する。実験により,sldは音声識別法が異なるasrタスクにおいて,音声トークンを効果的にモデル化し,デコーダのみのトランスフォーマーの損失マスキングよりも優れることを示した。ソースコードは以下の通り。 https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld

関連論文リスト

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
音声言語モデル(SLM)はテキストと音声を処理し、同時に音声の理解と生成を可能にする。 DC-Spinは音声信号とSLMトークンをブリッジすることで音声のトークン化を改善することを目的としている。本稿では,再学習や劣化を伴わずに,ストリーム可能なDC-Spinを実現するためのチャンクワイズ手法を提案する。
論文参考訳（メタデータ） (2024-10-31T17:43:13Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
本稿では, ベクトル量子化をエンコーダに挿入することにより, 多言語音声認識モデルから導出される, 教師付きセマンティックトークンを用いた音声表現を提案する。トークンをベースとした拡張性のあるゼロショットTSシンセサイザーであるCosyVoiceは,テキスト・ツー・ツー・ケン生成のためのLLMと,トークン・ツー・音声合成のための条件付きフローマッチングモデルから構成される。
論文参考訳（メタデータ） (2024-07-07T15:16:19Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit は Transformer アーキテクチャを使用するシーケンス・ツー・シーケンス・エンコーダ・デコーダモデルである。また,本モデルでは,書き起こし条件付けの有無にかかわらず,分離の点で優れた性能を発揮することを示す。また、自動音声認識(ASR)の性能を測定し、音声合成の音声サンプルを提供し、我々のモデルの有用性を実証する。
論文参考訳（メタデータ） (2023-08-21T01:52:01Z)
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text [65.04385919645395]
token2vecは、音声の離散表現に基づく、未ペア音声とテキストのための新しい事前学習フレームワークである。実験の結果、 token2vec は様々な音声のみの事前学習ベースラインよりも大幅に優れており、WER の相対的な減少率は17.7%である。
論文参考訳（メタデータ） (2022-10-30T06:38:19Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
本稿では,事前定義した統一表現と音声とテキストの事前学習を協調させるクロスモーダル音声言語モデル(SpeechLM)を提案する。具体的には、音声とテキストのモダリティをブリッジするために、2つの別の離散トークン化器を導入する。音声認識, 音声翻訳, ユニバーサル表現評価フレームワーク SUPERB など, 様々な音声言語処理タスクにおける音声LM の評価を行った。
論文参考訳（メタデータ） (2022-09-30T09:12:10Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
LASO (Listen Attentively, and Spell Once) と呼ばれる非自動回帰音声認識モデルを提案する。モデルは、エンコーダ、デコーダ、および位置依存集合体(PDS)からなる。
論文参考訳（メタデータ） (2021-02-15T15:18:59Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。