VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
- URL: http://arxiv.org/abs/2505.21445v1
- Date: Tue, 27 May 2025 17:16:59 GMT
- Title: VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
- Authors: Zhiqi Ai, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu,
- Abstract summary: We present a large-scale longitudinal dataset collected from 293 speakers over several years, with the longest time span reaching 17 years (approximately 900 weeks)<n>We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.
- Score: 14.375859578488456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.
Related papers
- On Barriers to Archival Audio Processing [16.244692109502726]
We leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods.<n>Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech.<n>However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language.
arXiv Detail & Related papers (2025-07-11T17:27:11Z) - Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers [50.9040167152168]
We analyze neurons associated with k-means clusters of self-supervised features and i-vectors.<n>Our analysis reveals that these clusters correspond to broad phonetic and gender classes.<n>By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task.
arXiv Detail & Related papers (2025-06-26T18:54:26Z) - SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors [23.837811649327094]
SeniorTalk is a carefully annotated Chinese spoken dialogue dataset.<n>This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants.<n>We perform experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks.
arXiv Detail & Related papers (2025-03-20T11:31:47Z) - Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization [17.048523623756623]
We investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks.<n>We propose several metrics to perform automatic speaker verification based only on phoneme durations.
arXiv Detail & Related papers (2024-12-22T21:18:08Z) - Speaker Verification in Agent-Generated Conversations [47.6291644653831]
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks.
This study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker.
arXiv Detail & Related papers (2024-05-16T14:46:18Z) - Leveraging Speaker Embeddings with Adversarial Multi-task Learning for
Age Group Classification [0.0]
We consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups.
Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios.
arXiv Detail & Related papers (2023-01-22T05:01:13Z) - Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric
and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users.
Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech.
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - A Longitudinal Multi-modal Dataset for Dementia Monitoring and Diagnosis [22.672055089496972]
We introduce a novel fine-grained longitudinal multi-modal corpus collected from healthy controls and people with dementia.
The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information.
arXiv Detail & Related papers (2021-09-03T14:02:12Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.