Residual Information in Deep Speaker Embedding Architectures
- URL: http://arxiv.org/abs/2302.02742v1
- Date: Mon, 6 Feb 2023 12:37:57 GMT
- Title: Residual Information in Deep Speaker Embedding Architectures
- Authors: Adriana Stan
- Abstract summary: This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures.
The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments.
The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
- Score: 4.619541348328938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker embeddings represent a means to extract representative vectorial
representations from a speech signal such that the representation pertains to
the speaker identity alone. The embeddings are commonly used to classify and
discriminate between different speakers. However, there is no objective measure
to evaluate the ability of a speaker embedding to disentangle the speaker
identity from the other speech characteristics. This means that the embeddings
are far from ideal, highly dependent on the training corpus and still include a
degree of residual information pertaining to factors such as linguistic
content, recording conditions or speaking style of the utterance. This paper
introduces an analysis over six sets of speaker embeddings extracted with some
of the most recent and high-performing DNN architectures, and in particular,
the degree to which they are able to truly disentangle the speaker identity
from the speech signal. To correctly evaluate the architectures, a large
multi-speaker parallel speech dataset is used. The dataset includes 46 speakers
uttering the same set of prompts, recorded in either a professional studio or
their home environments. The analysis looks into the intra- and inter-speaker
similarity measures computed over the different embedding sets, as well as if
simple classification and regression methods are able to extract several
residual information factors from the speaker embeddings. The results show that
the discriminative power of the analyzed embeddings is very high, yet across
all the analyzed architectures, residual information is still present in the
representations in the form of a high correlation to the recording conditions,
linguistic contents and utterance duration.
Related papers
- Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - An analysis on the effects of speaker embedding choice in non
auto-regressive TTS [4.619541348328938]
We introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets.
We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well.
arXiv Detail & Related papers (2023-07-19T10:57:54Z) - Quantitative Evidence on Overlooked Aspects of Enrollment Speaker
Embeddings for Target Speaker Separation [14.013049471563141]
Single channel target speaker separation aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker.
A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings.
arXiv Detail & Related papers (2022-10-23T07:08:46Z) - Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed.
Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics.
Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - Leveraging speaker attribute information using multi task learning for
speaker verification and diarization [33.60058873783114]
We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application.
We show that by leveraging two additional forms of speaker attribute information, we improve the performance of our deep speaker embeddings for both verification and diarization tasks.
arXiv Detail & Related papers (2020-10-27T13:10:51Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.