Speech Diarization and ASR with GMM
- URL: http://arxiv.org/abs/2307.05637v1
- Date: Tue, 11 Jul 2023 09:25:39 GMT
- Title: Speech Diarization and ASR with GMM
- Authors: Aayush Kumar Sharma, Vineet Bhavikatti, Amogh Nidawani, Dr.
Siddappaji, Sanath P, Dr Geetishree Mishra
- Abstract summary: Speech diarization involves the separation of individual speakers within an audio stream.
ASR entails the conversion of an unknown speech waveform into a corresponding written transcription.
Our primary objective typically revolves around developing a model that minimizes the Word Error Rate (WER) metric during speech transcription.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this research paper, we delve into the topics of Speech Diarization and
Automatic Speech Recognition (ASR). Speech diarization involves the separation
of individual speakers within an audio stream. By employing the ASR transcript,
the diarization process aims to segregate each speaker's utterances, grouping
them based on their unique audio characteristics. On the other hand, Automatic
Speech Recognition refers to the capability of a machine or program to identify
and convert spoken words and phrases into a machine-readable format. In our
speech diarization approach, we utilize the Gaussian Mixer Model (GMM) to
represent speech segments. The inter-cluster distance is computed based on the
GMM parameters, and the distance threshold serves as the stopping criterion.
ASR entails the conversion of an unknown speech waveform into a corresponding
written transcription. The speech signal is analyzed using synchronized
algorithms, taking into account the pitch frequency. Our primary objective
typically revolves around developing a model that minimizes the Word Error Rate
(WER) metric during speech transcription.
Related papers
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.