Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
- URL: http://arxiv.org/abs/2407.07235v1
- Date: Tue, 9 Jul 2024 21:19:49 GMT
- Title: Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology
- Authors: Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli,
- Abstract summary: trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity.
We present the Versatile Voice dataset (VVD), a collection of three speakers modifying their voices along gendered axes.
- Score: 1.7126708168238125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As experts in voice modification, trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity. To demonstrate this, we present the Versatile Voice Dataset (VVD), a collection of three speakers modifying their voices along gendered axes. The VVD illustrates that current approaches in speaker modeling, based on categorical notions of gender and a static understanding of vocal texture, fail to account for the flexibility of the vocal tract. Utilizing publicly-available speaker embeddings, we demonstrate that gender classification systems are highly sensitive to voice modification, and speaker verification systems fail to identify voices as coming from the same speaker as voice modification becomes more drastic. As one path towards moving beyond categorical and static notions of speaker identity, we propose modeling individual qualities of vocal texture such as pitch, resonance, and weight.
Related papers
- Voice Passing : a Non-Binary Voice Gender Prediction System for evaluating Transgender voice transition [0.7915536524413253]
This paper presents a software allowing to describe voices using a continuous Voice Femininity Percentage (VFP)
It is intended for transgender speakers during their voice transition and for voice therapists supporting them in this process.
arXiv Detail & Related papers (2024-04-23T16:15:39Z) - Creating New Voices using Normalizing Flows [16.747198180269127]
We investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities.
We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis.
arXiv Detail & Related papers (2023-12-22T10:00:24Z) - How To Build Competitive Multi-gender Speech Translation Models For
Controlling Speaker Gender Translation [21.125217707038356]
When translating from notional gender languages into grammatical gender languages, the generated translation requires explicit gender assignments for various words, including those referring to the speaker.
To avoid such biased and not inclusive behaviors, the gender assignment of speaker-related expressions should be guided by externally-provided metadata about the speaker's gender.
This paper aims to achieve the same results by integrating the speaker's gender metadata into a single "multi-gender" neural ST model, easier to maintain.
arXiv Detail & Related papers (2023-10-23T17:21:32Z) - Towards an Interpretable Representation of Speaker Identity via
Perceptual Voice Qualities [4.95865031722089]
We propose a possible interpretable representation of speaker identity based on perceptual voice qualities (PQs)
Contrary to prior belief, we demonstrate that these PQs are hearable by ensembles of non-experts.
arXiv Detail & Related papers (2023-10-04T00:06:17Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Generating Multilingual Gender-Ambiguous Text-to-Speech Voices [4.005334718121374]
This work addresses the task of generating novel gender-ambiguous TTS voices in a multi-speaker, multilingual setting.
To our knowledge, this is the first systematic and validated approach that can reliably generate a variety of gender-ambiguous voices.
arXiv Detail & Related papers (2022-11-01T10:40:24Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.