SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
- URL: http://arxiv.org/abs/2503.16578v1
- Date: Thu, 20 Mar 2025 11:31:47 GMT
- Title: SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
- Authors: Yang Chen, Hui Wang, Shiyao Wang, Junyang Chen, Jiabei He, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin,
- Abstract summary: SeniorTalk is a carefully annotated Chinese spoken dialogue dataset.<n>This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants.<n>We perform experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks.
- Score: 23.837811649327094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.
Related papers
- SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description [19.064845530513285]
We propose an automatic speech annotation system for interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions.
Our system provides in-depth understandings of speech style through tailored natural language descriptions.
It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips.
arXiv Detail & Related papers (2024-08-24T15:36:08Z) - Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model [47.67067056593085]
We develop a pipeline capable of transforming single-channel dialogue data into pseudo-stereo data.
This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours.
The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models.
arXiv Detail & Related papers (2024-07-02T03:22:41Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Leveraging Speaker Embeddings with Adversarial Multi-task Learning for
Age Group Classification [0.0]
We consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups.
Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios.
arXiv Detail & Related papers (2023-01-22T05:01:13Z) - Sentiment recognition of Italian elderly through domain adaptation on
cross-corpus speech dataset [77.99182201815763]
The aim of this work is to define a speech emotion recognition (SER) model able to recognize positive, neutral and negative emotions in natural conversations of Italian elderly people.
arXiv Detail & Related papers (2022-11-14T12:39:41Z) - Data-augmented cross-lingual synthesis in a teacher-student framework [3.2548794659022398]
Cross-lingual synthesis is the task of letting a speaker generate fluent synthetic speech in another language.
Previous research shows that many models appear to have insufficient generalization capabilities.
We propose to apply the teacher-student paradigm to cross-lingual synthesis.
arXiv Detail & Related papers (2022-03-31T20:01:32Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.