The THUEE System Description for the IARPA OpenASR21 Challenge
- URL: http://arxiv.org/abs/2206.14660v1
- Date: Wed, 29 Jun 2022 14:03:05 GMT
- Title: The THUEE System Description for the IARPA OpenASR21 Challenge
- Authors: Jing Zhao, Haoyu Wang, Jinpeng Li, Shuzhou Chai, Guan-Bo Wang, Guoguo
Chen, Wei-Qiang Zhang
- Abstract summary: This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21)
We achieve outstanding results under both the Constrained and Constrained-plus training conditions.
We find that the feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture.
- Score: 12.458730613670316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes the THUEE team's speech recognition system for the IARPA
Open Automatic Speech Recognition Challenge (OpenASR21), with further
experiment explorations. We achieve outstanding results under both the
Constrained and Constrained-plus training conditions. For the Constrained
training condition, we construct our basic ASR system based on the standard
hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we
extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for
both OOV and potential new words. Standard acoustic model structures such as
CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data
augmentation techniques are applied. For the Constrained-plus training
condition, we use the self-supervised learning framework wav2vec2.0. We
experiment with various fine-tuning techniques with the Connectionist Temporal
Classification (CTC) criterion on top of the publicly available pre-trained
model XLSR-53. We find that the frontend feature extractor plays an important
role when applying the wav2vec2.0 pre-trained model to the encoder-decoder
based CTC/Attention ASR architecture. Extra improvements can be achieved by
using the CTC model finetuned in the target language as the frontend feature
extractor.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units [8.86336076082867]
We propose a method for pretraining E2E KWS systems with untranscribed data.
We show that finetuning such a model significantly outperforms a model trained from scratch.
arXiv Detail & Related papers (2024-07-05T17:07:58Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Improving CTC-based speech recognition via knowledge transferring from
pre-trained language models [30.599901925058873]
We propose two knowledge transferring methods to improve CTC-based models.
The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target.
The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture.
arXiv Detail & Related papers (2022-02-22T11:30:55Z) - Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model [4.490054848527943]
We propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models.
To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
arXiv Detail & Related papers (2021-12-14T09:38:31Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.