Self-supervised Predictive Coding Models Encode Speaker and Phonetic
Information in Orthogonal Subspaces
- URL: http://arxiv.org/abs/2305.12464v3
- Date: Mon, 11 Dec 2023 11:36:22 GMT
- Title: Self-supervised Predictive Coding Models Encode Speaker and Phonetic
Information in Orthogonal Subspaces
- Authors: Oli Liu, Hao Tang, Sharon Goldwater
- Abstract summary: Self-supervised speech representations are known to encode both speaker and phonetic information.
We propose a new speaker normalization method which collapses the subspace that encodes speaker information.
- Score: 14.301142521638123
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised speech representations are known to encode both speaker and
phonetic information, but how they are distributed in the high-dimensional
space remains largely unexplored. We hypothesize that they are encoded in
orthogonal subspaces, a property that lends itself to simple disentanglement.
Applying principal component analysis to representations of two predictive
coding models, we identify two subspaces that capture speaker and phonetic
variances, and confirm that they are nearly orthogonal. Based on this property,
we propose a new speaker normalization method which collapses the subspace that
encodes speaker information, without requiring transcriptions. Probing
experiments show that our method effectively eliminates speaker information and
outperforms a previous baseline in phone discrimination tasks. Moreover, the
approach generalizes and can be used to remove information of unseen speakers.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Joint speaker diarisation and tracking in switching state-space model [51.58295550366401]
This paper proposes to explicitly track the movements of speakers while jointly performing diarisation within a unified model.
A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers.
Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
arXiv Detail & Related papers (2021-09-23T04:43:58Z) - Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery [3.128267020893596]
We propose an unsupervised speaker normalization technique prior to unit discovery.
It is based on separating speaker related from content induced variations in a speech signal with an adversarial contrastive predictive coding approach.
Experiments on English, Yoruba and Mboshi show improvements compared to using non-normalized input.
arXiv Detail & Related papers (2021-05-04T22:40:41Z) - Leveraging speaker attribute information using multi task learning for
speaker verification and diarization [33.60058873783114]
We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application.
We show that by leveraging two additional forms of speaker attribute information, we improve the performance of our deep speaker embeddings for both verification and diarization tasks.
arXiv Detail & Related papers (2020-10-27T13:10:51Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.