Bi-LSTM Scoring Based Similarity Measurement with Agglomerative
Hierarchical Clustering (AHC) for Speaker Diarization
- URL: http://arxiv.org/abs/2205.09709v1
- Date: Thu, 19 May 2022 17:20:51 GMT
- Title: Bi-LSTM Scoring Based Similarity Measurement with Agglomerative
Hierarchical Clustering (AHC) for Speaker Diarization
- Authors: Siddharth S. Nijhawan and Homayoon Beigi
- Abstract summary: A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences.
Recent advancements in diarization technology leverage neural network-based approaches to improvise speaker diarization system.
We propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Majority of speech signals across different scenarios are never available
with well-defined audio segments containing only a single speaker. A typical
conversation between two speakers consists of segments where their voices
overlap, interrupt each other or halt their speech in between multiple
sentences. Recent advancements in diarization technology leverage neural
network-based approaches to improvise multiple subsystems of speaker
diarization system comprising of extracting segment-wise embedding features and
detecting changes in the speaker during conversation. However, to identify
speaker through clustering, models depend on methodologies like PLDA to
generate similarity measure between two extracted segments from a given
conversational audio. Since these algorithms ignore the temporal structure of
conversations, they tend to achieve a higher Diarization Error Rate (DER), thus
leading to misdetections both in terms of speaker and change identification.
Therefore, to compare similarity of two speech segments both independently and
sequentially, we propose a Bi-directional Long Short-term Memory network for
estimating the elements present in the similarity matrix. Once the similarity
matrix is generated, Agglomerative Hierarchical Clustering (AHC) is applied to
further identify speaker segments based on thresholding. To evaluate the
performance, Diarization Error Rate (DER%) metric is used. The proposed model
achieves a low DER of 34.80% on a test set of audio samples derived from ICSI
Meeting Corpus as compared to traditional PLDA based similarity measurement
mechanism which achieved a DER of 39.90%.
Related papers
- Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - VarArray: Array-Geometry-Agnostic Continuous Speech Separation [26.938313513582642]
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription.
This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model.
arXiv Detail & Related papers (2021-10-12T05:31:46Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed.
Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics.
Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.