Interactive Real-Time Speaker Diarization Correction with Human Feedback
- URL: http://arxiv.org/abs/2509.18377v1
- Date: Mon, 22 Sep 2025 20:01:20 GMT
- Title: Interactive Real-Time Speaker Diarization Correction with Human Feedback
- Authors: Xinlu He, Yiwen Guan, Badrivishal Paurana, Zilin Dai, Jacob Whitehill,
- Abstract summary: We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time.<n>Our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%.
- Score: 2.149447183865652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.
Related papers
- Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings [52.985061676464554]
We propose a Knowledge Distillation based training approach for short context speaker embedding extraction.<n>We leverage the spatial information of the speaker of interest using beamforming to reduce overlap.<n>Results demonstrate that our models are effective at short-context embedding extraction and more robust to overlap.
arXiv Detail & Related papers (2025-08-18T11:32:13Z) - SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models [15.098665255729507]
We introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM.<n>Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets.
arXiv Detail & Related papers (2025-01-14T20:24:12Z) - Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - AG-LSEC: Audio Grounded Lexical Speaker Error Correction [9.54540722574194]
Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines.
We propose to enhance and acoustically ground the Lexical Speaker Error Correction (LSEC) system with speaker scores directly derived from the existing SD pipeline.
This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
arXiv Detail & Related papers (2024-06-25T04:20:49Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Lexical Speaker Error Correction: Leveraging Language Models for Speaker
Diarization Error Correction [4.409889336732851]
Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words.
This approach can lead to speaker errors especially around speaker turns and regions of speaker overlap.
We propose a novel second-pass speaker error correction system using lexical information.
arXiv Detail & Related papers (2023-06-15T17:47:41Z) - BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR [54.23941663326509]
Frequent speaker changes can make speaker change prediction difficult.
We propose boundary-aware serialized output training (BA-SOT)
Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%.
arXiv Detail & Related papers (2023-05-23T06:08:13Z) - Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection [25.940199825317073]
We propose a cross-modal post-processing system for speech recognizers.
It fuses acoustic features and textual features from different modalities.
It joints a confidence estimator and an error corrector in multi-task learning fashion.
arXiv Detail & Related papers (2022-01-10T12:29:55Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.