Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
- URL: http://arxiv.org/abs/2406.07890v1
- Date: Wed, 12 Jun 2024 05:41:01 GMT
- Title: Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
- Authors: Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan,
- Abstract summary: We show that exemplary speech foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate.
Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
- Score: 28.5211771482547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
Related papers
- Multimodal Input Aids a Bayesian Model of Phonetic Learning [0.6827423171182154]
We introduce a method for creating high-quality synthetic videos of speakers' faces for an existing audio corpus.
Our learning model, when both trained and tested on audiovisual inputs, achieves up to a 8.1% relative improvement on a phoneme discrimination battery.
Visual information is especially beneficial in noisy audio environments.
arXiv Detail & Related papers (2024-07-22T19:00:11Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Improving Children's Speech Recognition by Fine-tuning Self-supervised
Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies.
Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity.
We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv Detail & Related papers (2022-11-14T22:03:36Z) - A Data-Driven Investigation of Noise-Adaptive Utterance Generation with
Linguistic Modification [25.082714256583422]
In noisy environments, speech can be hard to understand for humans.
We create a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing.
We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB.
arXiv Detail & Related papers (2022-10-19T02:20:17Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Senone-aware Adversarial Multi-task Training for Unsupervised Child to
Adult Speech Adaptation [26.065719754453823]
We propose a feature adaptation approach to minimize acoustic mismatch at the senone (tied triphone states) level between adult and child speech.
We validate the proposed method on three tasks: child speech recognition, child pronunciation assessment, and child fluency score prediction.
arXiv Detail & Related papers (2021-02-23T04:49:27Z) - Towards Modelling Coherence in Spoken Discourse [48.80477600384429]
Coherence in spoken discourse is dependent on the prosodic and acoustic patterns in speech.
We model coherence in spoken discourse with audio-based coherence models.
arXiv Detail & Related papers (2020-12-31T20:18:29Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Learning to Understand Child-directed and Adult-directed Speech [18.29692441616062]
Human language acquisition research indicates that child-directed speech helps language learners.
We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS)
We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better.
arXiv Detail & Related papers (2020-05-06T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.