Related papers: AG-LSEC: Audio Grounded Lexical Speaker Error Correction

AG-LSEC: Audio Grounded Lexical Speaker Error Correction

URL: http://arxiv.org/abs/2406.17266v1
Date: Tue, 25 Jun 2024 04:20:49 GMT
Title: AG-LSEC: Audio Grounded Lexical Speaker Error Correction
Authors: Rohit Paturi, Xiang Li, Sundararajan Srinivasan,
Abstract summary: Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines. We propose to enhance and acoustically ground the Lexical Speaker Error Correction (LSEC) system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
Score: 9.54540722574194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.

Related papers

SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models [15.098665255729507]
We introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets.
arXiv Detail & Related papers (2025-01-14T20:24:12Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model. We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses. The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z)
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech. We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z)
Convoifilter: A case study of doing cocktail party speech recognition [59.80042864360884]
The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
arXiv Detail & Related papers (2023-08-22T12:09:30Z)
Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction [4.409889336732851]
Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words. This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap. We propose a novel second-pass speaker error correction system using lexical information.
arXiv Detail & Related papers (2023-06-15T17:47:41Z)
ASR Error Detection via Audio-Transcript entailment [1.3750624267664155]
We propose an end-to-end approach for ASR error detection using audio-transcript entailment. The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively. Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.
arXiv Detail & Related papers (2022-07-22T02:47:15Z)
PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z)
Segmenting Subtitles for Correcting ASR Segmentation Errors [11.854481771567503]
We propose a model for correcting the acoustic segmentation of ASR models for low-resource languages. We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance.
arXiv Detail & Related papers (2021-04-16T03:04:10Z)
End-to-End Speaker-Attributed ASR with Transformer [41.7739129773237]
This paper presents an end-to-end speaker-attributed automatic speech recognition system. It jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.
arXiv Detail & Related papers (2021-04-05T19:54:15Z)
Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.