Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
- URL: http://arxiv.org/abs/2506.11074v1
- Date: Wed, 04 Jun 2025 00:09:53 GMT
- Title: Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
- Authors: Tarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristia, Marvin Lavechin,
- Abstract summary: This paper demonstrates several obstacles blocking progress by summarizing three years' worth of experiments aimed at improving one fundamental task: Voice Type Classification.<n>Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance.<n>More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
- Score: 44.40187506078601
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recordings gathered with child-worn devices promised to revolutionize both fundamental and applied speech sciences by allowing the effortless capture of children's naturalistic speech environment and language production. This promise hinges on speech technologies that can transform the sheer mounds of data thus collected into usable information. This paper demonstrates several obstacles blocking progress by summarizing three years' worth of experiments aimed at improving one fundamental task: Voice Type Classification. Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance. More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
Related papers
- Speech Unlearning [14.755831733659699]
We introduce machine unlearning for speech tasks, a novel and underexplored research problem.<n>It aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining.<n>It has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation.
arXiv Detail & Related papers (2025-06-01T06:04:16Z) - Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data [22.933382649048113]
We propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data.
We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$times$ over human annotations.
arXiv Detail & Related papers (2024-06-25T20:37:16Z) - Children's Speech Recognition through Discrete Token Enhancement [7.964926333613502]
We investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance.
Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
arXiv Detail & Related papers (2024-06-19T10:45:12Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.