A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings
- URL: http://arxiv.org/abs/2203.16834v2
- Date: Fri, 1 Apr 2022 04:24:48 GMT
- Title: A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings
- Authors: Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie
- Abstract summary: Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
- Score: 53.120885867427305
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we conduct a comparative study on speaker-attributed automatic
speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with
increasing attention in meeting rich transcription. Specifically, three
approaches are evaluated in this study. The first approach, FD-SOT, consists of
a frame-level diarization model to identify speakers and a multi-talker ASR to
recognize utterances. The speaker-attributed transcriptions are obtained by
aligning the diarization results and recognized hypotheses. However, such an
alignment strategy may suffer from erroneous timestamps due to the modular
independence, severely hindering the model performance. Therefore, we propose
the second approach, WD-SOT, to address alignment errors by introducing a
word-level diarization model, which can get rid of such timestamp alignment
dependency. To further mitigate the alignment issues, we propose the third
approach, TS-ASR, which trains a target-speaker separation module and an ASR
module jointly. By comparing various strategies for each SA-ASR approach,
experimental results on a real meeting scenario corpus, AliMeeting, reveal that
the WD-SOT approach achieves 10.7% relative reduction on averaged
speaker-dependent character error rate (SD-CER), compared with the FD-SOT
approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach
and brings 16.5% relative average SD-CER reduction.
Related papers
- MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction [23.812838405442953]
We introduce a novel multi-modal fusion method to learn shared representations across modalities.
Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1%.
arXiv Detail & Related papers (2024-01-24T06:55:55Z) - Unified Modeling of Multi-Talker Overlapped Speech Recognition and
Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization.
We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one.
We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z) - Cross-utterance ASR Rescoring with Graph-based Label Propagation [14.669201156515891]
We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation.
In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information.
arXiv Detail & Related papers (2023-03-27T12:08:05Z) - Factual Consistency Oriented Speech Recognition [23.754107608608106]
The proposed framework optimize the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions.
It is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries.
arXiv Detail & Related papers (2023-02-24T00:01:41Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.