Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information
- URL: http://arxiv.org/abs/2111.13694v1
- Date: Sun, 28 Nov 2021 12:51:04 GMT
- Title: Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information
- Authors: Zhihao Du, Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei
- Abstract summary: We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
- Score: 55.75018546938499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overlapping speech diarization is always treated as a multi-label
classification problem. In this paper, we reformulate this task as a
single-label prediction problem by encoding the multi-speaker labels with power
set. Specifically, we propose the speaker embedding-aware neural diarization
(SEND) method, which predicts the power set encoded labels according to the
similarities between speech features and given speaker embeddings. Our method
is further extended and integrated with downstream tasks by utilizing the
textual information, which has not been well studied in previous literature.
The experimental results show that our method achieves lower diarization error
rate than the target-speaker voice activity detection. When textual information
is involved, the diarization errors can be further reduced. For the real
meeting scenario, our method can achieve 34.11% relative improvement compared
with the Bayesian hidden Markov model based clustering algorithm.
Related papers
- Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Improving the Intent Classification accuracy in Noisy Environment [9.447108578893639]
In this paper, we investigate how environmental noise and related noise reduction techniques to address the intent classification task with end-to-end neural models.
For this task, the use of speech enhancement greatly improves the classification accuracy in noisy conditions.
arXiv Detail & Related papers (2023-03-12T06:11:44Z) - Distant finetuning with discourse relations for stance classification [55.131676584455306]
We propose a new method to extract data with silver labels from raw text to finetune a model for stance classification.
We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages.
Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater.
arXiv Detail & Related papers (2022-04-27T04:24:35Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Improving End-to-End Contextual Speech Recognition with Fine-grained
Contextual Knowledge Selection [21.116123328330467]
This work focuses on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS)
We first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates.
We re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations.
arXiv Detail & Related papers (2022-01-30T13:08:16Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.