Related papers: LLM-based speaker diarization correction: A generalizable approach

LLM-based speaker diarization correction: A generalizable approach

URL: http://arxiv.org/abs/2406.04927v2
Date: Fri, 13 Sep 2024 20:42:20 GMT
Title: LLM-based speaker diarization correction: A generalizable approach
Authors: Georgios Efstathiadis, Vijay Yadav, Anzar Abbas,
Abstract summary: We investigate the use of large language models (LLMs) for diarization correction as a post-processing step. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset from the Fisher corpus as well as an independent dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We have made the weights of these models publicly available on HuggingFace at https://huggingface.co/bklynhlth.

Related papers

SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
Rethinking Data: Towards Better Performing Domain-Specific Small Language Models [0.0]
This paper presents our approach to finetuning a small Language Models (LM) We achieve this by improving data quality at each stage of the LM training pipeline. We improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets.
arXiv Detail & Related papers (2025-03-03T12:19:12Z)
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR) In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs) Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO) We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training. We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z)
LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation [15.520180125182756]
Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. Existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. We propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR.
arXiv Detail & Related papers (2024-09-13T07:28:47Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries. This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z)
Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR) We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
On-the-fly Text Retrieval for End-to-End ASR Adaptation [9.304386210911822]
We propose augmenting a transducer-based ASR model with a retrieval language model, which retrieves from an external text corpus plausible completions for a partial ASR hypothesis. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets.
arXiv Detail & Related papers (2023-03-20T08:54:40Z)
Scene Text Recognition with Permuted Autoregressive Sequence Models [15.118059441365343]
Context-aware STR methods typically use internal autoregressive (AR) language models (LM) Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling. It achieves context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context.
arXiv Detail & Related papers (2022-07-14T14:51:50Z)
Automatic Learning of Subword Dependent Model Scales [50.105894487730545]
We show that the model scales for a combination of attention encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning. We extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB.
arXiv Detail & Related papers (2021-10-18T13:48:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.