BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder
- URL: http://arxiv.org/abs/2211.00792v1
- Date: Wed, 2 Nov 2022 00:10:43 GMT
- Title: BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder
- Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe
- Abstract summary: We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model.
BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task.
- Score: 43.39035144463951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech
recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced
encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR
has been actively studied, aiming to utilize versatile linguistic knowledge for
generating accurate text. One crucial factor that makes this integration
challenging lies in the vocabulary mismatch; the vocabulary constructed for a
pre-trained LM is generally too large for E2E-ASR training and is likely to
have a mismatch against a target ASR domain. To overcome such an issue, we
propose BECTRA, an extended version of our previous BERT-CTC, that realizes
BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based
model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder
using a vocabulary suitable for a target task. With the combination of the
transducer and BERT-CTC, we also propose a novel inference algorithm for taking
advantage of both autoregressive and non-autoregressive decoding. Experimental
results on several ASR tasks, varying in amounts of data, speaking styles, and
languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing
with the vocabulary mismatch while exploiting BERT knowledge.
Related papers
- Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning.
In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling.
The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z) - Enhancing EEG-to-Text Decoding through Transferable Representations from Pre-trained Contrastive EEG-Text Masked Autoencoder [69.7813498468116]
We propose Contrastive EEG-Text Masked Autoencoder (CET-MAE), a novel model that orchestrates compound self-supervised learning across and within EEG and text.
We also develop a framework called E2T-PTR (EEG-to-Text decoding using Pretrained Transferable Representations) to decode text from EEG sequences.
arXiv Detail & Related papers (2024-02-27T11:45:21Z) - Utilizing BERT for Information Retrieval: Survey, Applications,
Resources, and Challenges [4.588192657854766]
This survey focuses on approaches that apply pretrained transformer encoders like BERT to information retrieval (IR)
We group them into six high-level categories: (i) handling long documents, (ii) integrating semantic information, (iii) balancing effectiveness and efficiency, (iv) predicting the weights of terms, (v) query expansion, and (vi) document expansion.
We find that for specific tasks, finely tuned BERT encoders still outperform, and at a lower deployment cost.
arXiv Detail & Related papers (2024-02-18T23:22:40Z) - Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding [29.80299587861207]
We propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR.
Unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively.
We show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
arXiv Detail & Related papers (2023-05-23T13:25:44Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - BERT-LID: Leveraging BERT to Improve Spoken Language Identification [12.179375898668614]
Language identification is a task of automatically determining the identity of a language conveyed by a spoken segment.
Despite language identification attaining high accuracy on medium or long utterances, the performance on short utterances is still far from satisfactory.
We propose an effective BERT-based language identification system (BERT-LID) to improve language identification performance.
arXiv Detail & Related papers (2022-03-01T10:01:25Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Context-Aware Transformer Transducer for Speech Recognition [21.916660252023707]
We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals.
We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
arXiv Detail & Related papers (2021-11-05T04:14:35Z) - Training ELECTRA Augmented with Multi-word Selection [53.77046731238381]
We present a new text encoder pre-training method that improves ELECTRA based on multi-task learning.
Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets.
arXiv Detail & Related papers (2021-05-31T23:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.