Speaker Diarization with Lexical Information
- URL: http://arxiv.org/abs/2004.06756v1
- Date: Mon, 13 Apr 2020 17:16:56 GMT
- Title: Speaker Diarization with Lexical Information
- Authors: Tae Jin Park, Kyu J. Han, Jing Huang, Xiaodong He, Bowen Zhou,
Panayiotis Georgiou and Shrikanth Narayanan
- Abstract summary: This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
- Score: 59.983797884955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a novel approach for speaker diarization to leverage
lexical information provided by automatic speech recognition. We propose a
speaker diarization system that can incorporate word-level speaker turn
probabilities with speaker embeddings into a speaker clustering process to
improve the overall diarization accuracy. To integrate lexical and acoustic
information in a comprehensive way during clustering, we introduce an adjacency
matrix integration for spectral clustering. Since words and word boundary
information for word-level speaker turn probability estimation are provided by
a speech recognition system, our proposed method works without any human
intervention for manual transcriptions. We show that the proposed method
improves diarization performance on various evaluation datasets compared to the
baseline diarization system using acoustic information only in speaker
embeddings.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed.
Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics.
Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.