Context-Aware Transformer Transducer for Speech Recognition
- URL: http://arxiv.org/abs/2111.03250v1
- Date: Fri, 5 Nov 2021 04:14:35 GMT
- Title: Context-Aware Transformer Transducer for Speech Recognition
- Authors: Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris,
Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann
- Abstract summary: We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals.
We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
- Score: 21.916660252023707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems often have
difficulty recognizing uncommon words, that appear infrequently in the training
data. One promising method, to improve the recognition accuracy on such rare
words, is to latch onto personalized/contextual information at inference. In
this work, we present a novel context-aware transformer transducer (CATT)
network that improves the state-of-the-art transformer-based ASR system by
taking advantage of such contextual signals. Specifically, we propose a
multi-head attention-based context-biasing network, which is jointly trained
with the rest of the ASR sub-networks. We explore different techniques to
encode contextual data and to create the final attention context vectors. We
also leverage both BLSTM and pretrained BERT based models to encode contextual
data and guide the network training. Using an in-house far-field dataset, we
show that CATT, using a BERT based context encoder, improves the word error
rate of the baseline transformer transducer and outperforms an existing deep
contextual model by 24.2% and 19.4% respectively.
Related papers
- Large Generative Model-assisted Talking-face Semantic Communication System [55.42631520122753]
This study introduces a Large Generative Model-assisted Talking-face Semantic Communication (LGM-TSC) system.
Generative Semantic Extractor (GSE) at the transmitter converts semantically sparse talking-face videos into texts with high information density.
Private Knowledge Base (KB) based on the Large Language Model (LLM) for semantic disambiguation and correction.
Generative Semantic Reconstructor (GSR) that utilizes BERT-VITS2 and SadTalker models to transform text back into a high-QoE talking-face video.
arXiv Detail & Related papers (2024-11-06T12:45:46Z) - Transformer-based Joint Source Channel Coding for Textual Semantic
Communication [23.431590618978948]
Space-Air-Ground-Sea integrated network calls for more robust and secure transmission techniques against jamming.
We propose a textual semantic transmission framework for robust transmission, which utilizes the advanced natural language processing techniques to model and encode sentences.
arXiv Detail & Related papers (2023-07-23T08:42:05Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Text-only domain adaptation for end-to-end ASR using integrated
text-to-mel-spectrogram generator [17.44686265224974]
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.
We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only.
arXiv Detail & Related papers (2023-02-27T18:47:55Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Improving Transformer-based Conversational ASR by Inter-Sentential
Attention Mechanism [20.782319059183173]
We propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition.
We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
arXiv Detail & Related papers (2022-07-02T17:17:47Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Dependency Parsing based Semantic Representation Learning with Graph
Neural Network for Enhancing Expressiveness of Text-to-Speech [49.05471750563229]
We propose a semantic representation learning method based on graph neural network, considering dependency relations of a sentence.
We show that our proposed method outperforms the baseline using vanilla BERT features both in LJSpeech and Bilzzard Challenge 2013 datasets.
arXiv Detail & Related papers (2021-04-14T13:09:51Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.