Towards an Interpretable Representation of Speaker Identity via
Perceptual Voice Qualities
- URL: http://arxiv.org/abs/2310.02497v1
- Date: Wed, 4 Oct 2023 00:06:17 GMT
- Title: Towards an Interpretable Representation of Speaker Identity via
Perceptual Voice Qualities
- Authors: Robin Netzorg, Bohan Yu, Andrea Guzman, Peter Wu, Luna McNulty, Gopala
Anumanchipalli
- Abstract summary: We propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs)
Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts.
- Score: 4.95865031722089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike other data modalities such as text and vision, speech does not lend
itself to easy interpretation. While lay people can understand how to describe
an image or sentence via perception, non-expert descriptions of speech often
end at high-level demographic information, such as gender or age. In this
paper, we propose a possible interpretable representation of speaker identity
based on perceptual voice qualities (PQs). By adding gendered PQs to the
pathology-focused Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V)
protocol, our PQ-based approach provides a perceptual latent space of the
character of adult voices that is an intermediary of abstraction between
high-level demographics and low-level acoustic, physical, or learned
representations. Contrary to prior belief, we demonstrate that these PQs are
hearable by ensembles of non-experts, and further demonstrate that the
information encoded in a PQ-based representation is predictable by various
speech representations.
Related papers
- Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology [1.7126708168238125]
trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity.
We present the Versatile Voice dataset (VVD), a collection of three speakers modifying their voices along gendered axes.
arXiv Detail & Related papers (2024-07-09T21:19:49Z) - Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction [23.115506530649988]
PerceptiveAgent is an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings.
PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language.
arXiv Detail & Related papers (2024-06-18T15:19:51Z) - Evaluating Speaker Identity Coding in Self-supervised Models and Humans [0.42303492200814446]
Speaker identity plays a significant role in human communication and is being increasingly used in societal applications.
We show that self-supervised representations from different families are significantly better for speaker identification over acoustic representations.
We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks.
arXiv Detail & Related papers (2024-06-14T20:07:21Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Residual Information in Deep Speaker Embedding Architectures [4.619541348328938]
This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures.
The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments.
The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations.
arXiv Detail & Related papers (2023-02-06T12:37:57Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z) - Adversarial Disentanglement of Speaker Representation for
Attribute-Driven Privacy Preservation [17.344080729609026]
We introduce the concept of attribute-driven privacy preservation in speaker voice representation.
It allows a person to hide one or more personal aspects to a potential malicious interceptor and to the application provider.
We propose an adversarial autoencoding method that disentangles in the voice representation a given speaker attribute thus allowing its concealment.
arXiv Detail & Related papers (2020-12-08T14:47:23Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.