Graph-based Label Propagation for Semi-Supervised Speaker Identification
- URL: http://arxiv.org/abs/2106.08207v1
- Date: Tue, 15 Jun 2021 15:10:33 GMT
- Title: Graph-based Label Propagation for Semi-Supervised Speaker Identification
- Authors: Long Chen, Venkatesh Ravichandran, Andreas Stolcke
- Abstract summary: We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario.
We show that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods.
- Score: 10.87690067963342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker identification in the household scenario (e.g., for smart speakers)
is typically based on only a few enrollment utterances but a much larger set of
unlabeled data, suggesting semisupervised learning to improve speaker profiles.
We propose a graph-based semi-supervised learning approach for speaker
identification in the household scenario, to leverage the unlabeled speech
samples. In contrast to most of the works in speaker recognition that focus on
speaker-discriminative embeddings, this work focuses on speaker label inference
(scoring). Given a pre-trained embedding extractor, graph-based learning allows
us to integrate information about both labeled and unlabeled utterances.
Considering each utterance as a graph node, we represent pairwise utterance
similarity scores as edge weights. Graphs are constructed per household, and
speaker identities are propagated to unlabeled nodes to optimize a global
consistency criterion. We show in experiments on the VoxCeleb dataset that this
approach makes effective use of unlabeled data and improves speaker
identification accuracy compared to two state-of-the-art scoring methods as
well as their semi-supervised variants based on pseudo-labels.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Learning Semantic Correspondence with Sparse Annotations [66.37298464505261]
Finding dense semantic correspondence is a fundamental problem in computer vision.
We propose a teacher-student learning paradigm for generating dense pseudo-labels.
We also develop two novel strategies for denoising pseudo-labels.
arXiv Detail & Related papers (2022-08-15T02:24:18Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Speaker attribution with voice profiles by graph-based semi-supervised
learning [29.042995008709916]
We propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods.
A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes and segments from test utterances are unlabeled nodes.
Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs)
arXiv Detail & Related papers (2021-02-06T18:35:56Z) - Leveraging speaker attribute information using multi task learning for
speaker verification and diarization [33.60058873783114]
We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application.
We show that by leveraging two additional forms of speaker attribute information, we improve the performance of our deep speaker embeddings for both verification and diarization tasks.
arXiv Detail & Related papers (2020-10-27T13:10:51Z) - Graph Attention Networks for Speaker Verification [43.01058120303278]
This work presents a novel back-end framework for speaker verification using graph attention networks.
We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks.
After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform.
arXiv Detail & Related papers (2020-10-22T09:08:02Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.