Hierarchical Transformer-based Large-Context End-to-end ASR with
Large-Context Knowledge Distillation
- URL: http://arxiv.org/abs/2102.07935v1
- Date: Tue, 16 Feb 2021 03:15:15 GMT
- Title: Hierarchical Transformer-based Large-Context End-to-end ASR with
Large-Context Knowledge Distillation
- Authors: Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro
Tanaka, Shota Orihashi
- Abstract summary: We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation.
This paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling.
- Score: 28.51624095262708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel large-context end-to-end automatic speech recognition
(E2E-ASR) model and its effective training method based on knowledge
distillation. Common E2E-ASR models have mainly focused on utterance-level
processing in which each utterance is independently transcribed. On the other
hand, large-context E2E-ASR models, which take into account long-range
sequential contexts beyond utterance boundaries, well handle a sequence of
utterances such as discourses and conversations. However, the transformer
architecture, which has recently achieved state-of-the-art ASR performance
among utterance-level ASR systems, has not yet been introduced into the
large-context ASR systems. We can expect that the transformer architecture can
be leveraged for effectively capturing not only input speech contexts but also
long-range sequential contexts beyond utterance boundaries. Therefore, this
paper proposes a hierarchical transformer-based large-context E2E-ASR model
that combines the transformer architecture with hierarchical encoder-decoder
based large-context modeling. In addition, in order to enable the proposed
model to use long-range sequential contexts, we also propose a large-context
knowledge distillation that distills the knowledge from a pre-trained
large-context language model in the training phase. We evaluate the
effectiveness of the proposed model and proposed training method on Japanese
discourse ASR tasks.
Related papers
- Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR.
The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Improving Transformer-based Conversational ASR by Inter-Sentential
Attention Mechanism [20.782319059183173]
We propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition.
We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
arXiv Detail & Related papers (2022-07-02T17:17:47Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Non-autoregressive Transformer-based End-to-end ASR using BERT [13.07939371864781]
This paper presents a transformer-based end-to-end automatic speech recognition (ASR) model based on BERT.
A series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results.
arXiv Detail & Related papers (2021-04-10T16:22:17Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.