Attention-based Contextual Language Model Adaptation for Speech
Recognition
- URL: http://arxiv.org/abs/2106.01451v1
- Date: Wed, 2 Jun 2021 20:19:57 GMT
- Title: Attention-based Contextual Language Model Adaptation for Speech
Recognition
- Authors: Richard Diehl Martinez, Scott Novotney, Ivan Bulyko, Ariya Rastrow,
Andreas Stolcke, Ankur Gandhe
- Abstract summary: We introduce an attention mechanism for training neural speech recognition language models on text and non-linguistic contextual data.
Our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information.
- Score: 13.516224963932858
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language modeling (LM) for automatic speech recognition (ASR) does not
usually incorporate utterance level contextual information. For some domains
like voice assistants, however, additional context, such as the time at which
an utterance was spoken, provides a rich input signal. We introduce an
attention mechanism for training neural speech recognition language models on
both text and non-linguistic contextual data. When applied to a large
de-identified dataset of utterances collected by a popular voice assistant
platform, our method reduces perplexity by 7.0% relative over a standard LM
that does not incorporate contextual information. When evaluated on utterances
extracted from the long tail of the dataset, our method improves perplexity by
9.0% relative over a standard LM and by over 2.8% relative when compared to a
state-of-the-art model for contextual LM.
Related papers
- End-to-End Speech Recognition Contextualization with Large Language
Models [25.198480789044346]
We introduce a novel method for contextualizing speech recognition models incorporating Large Language Models (LLMs)
We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion.
Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided.
arXiv Detail & Related papers (2023-09-19T20:28:57Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - Private Language Model Adaptation for Speech Recognition [15.726921748859393]
Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on users' local devices.
We introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition.
arXiv Detail & Related papers (2021-09-28T00:15:43Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.