Speaker attribution with voice profiles by graph-based semi-supervised
learning
- URL: http://arxiv.org/abs/2102.03634v1
- Date: Sat, 6 Feb 2021 18:35:56 GMT
- Title: Speaker attribution with voice profiles by graph-based semi-supervised
learning
- Authors: Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz,
Michael Brudno
- Abstract summary: We propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods.
A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes and segments from test utterances are unlabeled nodes.
Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs)
- Score: 29.042995008709916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker attribution is required in many real-world applications, such as
meeting transcription, where speaker identity is assigned to each utterance
according to speaker voice profiles. In this paper, we propose to solve the
speaker attribution problem by using graph-based semi-supervised learning
methods. A graph of speech segments is built for each session, on which
segments from voice profiles are represented by labeled nodes while segments
from test utterances are unlabeled nodes. The weight of edges between nodes is
evaluated by the similarities between the pretrained speaker embeddings of
speech segments. Speaker attribution then becomes a semi-supervised learning
problem on graphs, on which two graph-based methods are applied: label
propagation (LP) and graph neural networks (GNNs). The proposed approaches are
able to utilize the structural information of the graph to improve speaker
attribution performance. Experimental results on real meeting data show that
the graph based approaches reduce speaker attribution error by up to 68%
compared to a baseline speaker identification approach that processes each
utterance independently.
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-supervised Graphs for Audio Representation Learning with Limited
Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples.
We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition.
Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Graph-based Label Propagation for Semi-Supervised Speaker Identification [10.87690067963342]
We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario.
We show that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods.
arXiv Detail & Related papers (2021-06-15T15:10:33Z) - Graph Attention Networks for Speaker Verification [43.01058120303278]
This work presents a novel back-end framework for speaker verification using graph attention networks.
We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks.
After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform.
arXiv Detail & Related papers (2020-10-22T09:08:02Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.