Loss Masking Is Not Needed in Decoder-only Transformer for
Discrete-token-based ASR
- URL: http://arxiv.org/abs/2311.04534v2
- Date: Mon, 5 Feb 2024 02:42:57 GMT
- Title: Loss Masking Is Not Needed in Decoder-only Transformer for
Discrete-token-based ASR
- Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong
Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang
- Abstract summary: unified speech-text models have achieved remarkable performance on various speech tasks.
We propose to model speech tokens in an autoregressive way, similar to text.
We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance.
- Score: 58.136778669618096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and
AudioPaLM, have achieved remarkable performance on various speech tasks. These
models discretize speech signals into tokens (speech discretization) and use a
shared vocabulary for both text and speech tokens. Then they train a single
decoder-only Transformer on a mixture of speech tasks. However, these models
rely on the Loss Masking strategy for the ASR task, which ignores the
dependency among speech tokens. In this paper, we propose to model speech
tokens in an autoregressive way, similar to text. We find that applying the
conventional cross-entropy loss on input speech tokens does not consistently
improve the ASR performance over the Loss Masking approach. To address this
issue, we propose a novel approach denoted Smoothed Label Distillation (SLD),
which applies a KL divergence loss with smoothed labels on speech tokens. Our
experiments show that SLD effectively models speech tokens and outperforms Loss
Masking for decoder-only Transformers in ASR tasks with different speech
discretization methods. The source code can be found here:
https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld
Related papers
- DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation.
DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens.
We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.