Towards Effective and Compact Contextual Representation for Conformer
Transducer Speech Recognition Systems
- URL: http://arxiv.org/abs/2306.13307v2
- Date: Mon, 26 Jun 2023 02:48:53 GMT
- Title: Towards Effective and Compact Contextual Representation for Conformer
Transducer Speech Recognition Systems
- Authors: Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen,
Xunying Liu
- Abstract summary: This paper aims to derive a suitable compact representation of the most relevant history contexts.
Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline.
- Score: 39.90886684053726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current ASR systems are mainly trained and evaluated at the utterance level.
Long range cross utterance context can be incorporated. A key task is to derive
a suitable compact representation of the most relevant history contexts. In
contrast to previous researches based on either LSTM-RNN encoded histories that
attenuate the information from longer range contexts, or frame level
concatenation of transformer context embeddings, in this paper compact
low-dimensional cross utterance contextual features are learned in the
Conformer-Transducer Encoder using specially designed attention pooling layers
that are applied over efficiently cached preceding utterances history vectors.
Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed
contextualized streaming Conformer-Transducers outperform the baseline using
utterance internal context only with statistically significant WER reductions
of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data.
Related papers
- Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR [74.38242498079627]
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable.
In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems.
arXiv Detail & Related papers (2024-09-13T13:01:09Z) - Learning Repeatable Speech Embeddings Using An Intra-class Correlation
Regularizer [16.716653844774374]
We evaluate the repeatability of embeddings using the intra-class correlation coefficient (ICC)
We propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability.
We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice.
arXiv Detail & Related papers (2023-10-25T23:21:46Z) - CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Leveraging Cross-Utterance Context For ASR Decoding [6.033324057680156]
Cross utterance information has been shown to be beneficial during second pass re-scoring.
We investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search.
arXiv Detail & Related papers (2023-06-29T12:48:25Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.