ASR-Aware End-to-end Neural Diarization
- URL: http://arxiv.org/abs/2202.01286v1
- Date: Wed, 2 Feb 2022 21:17:14 GMT
- Title: ASR-Aware End-to-end Neural Diarization
- Authors: Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke
- Abstract summary: We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
- Score: 15.172086811068962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a Conformer-based end-to-end neural diarization (EEND) model that
uses both acoustic input and features derived from an automatic speech
recognition (ASR) model. Two categories of features are explored: features
derived directly from ASR output (phones, position-in-word and word boundaries)
and features derived from a lexical speaker change detection model, trained by
fine-tuning a pretrained BERT model on the ASR output. Three modifications to
the Conformer-based EEND architecture are proposed to incorporate the features.
First, ASR features are concatenated with acoustic features. Second, we propose
a new attention mechanism called contextualized self-attention that utilizes
ASR features to build robust speaker representations. Finally, multi-task
learning is used to train the model to minimize classification loss for the ASR
features along with diarization loss. Experiments on the two-speaker English
conversations of Switchboard+SRE data sets show that multi-task learning with
position-in-word information is the most effective way of utilizing ASR
features, reducing the diarization error rate (DER) by 20% relative to the
baseline.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning [6.60571587618006]
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and impacts automatic speech recognition (ASR) accuracy.
In this work, a time-domain recognition-oriented speech enhancement framework is proposed to improve speech intelligibility and advance ASR accuracy.
The framework serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model.
arXiv Detail & Related papers (2023-12-11T04:51:41Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Conversational Speech Recognition By Learning Conversation-level
Characteristics [25.75615870266786]
This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework.
Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate (CER) reduction.
arXiv Detail & Related papers (2022-02-16T04:33:05Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Feature Replacement and Combination for Hybrid ASR Systems [47.74348197215634]
We investigate the usefulness of one of these front-end frameworks, namely wav2vec, for hybrid ASR systems.
In addition to deploying a pre-trained feature extractor, we explore how to make use of an existing acoustic model (AM) trained on the same task with different features.
We obtain a relative improvement of 4% and 6% over our previous best model on the LibriSpeech test-clean and test-other sets.
arXiv Detail & Related papers (2021-04-09T11:04:58Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.