Children's Speech Recognition through Discrete Token Enhancement
- URL: http://arxiv.org/abs/2406.13431v2
- Date: Mon, 24 Jun 2024 15:31:59 GMT
- Title: Children's Speech Recognition through Discrete Token Enhancement
- Authors: Vrunda N. Sukhadia, Shammur Absar Chowdhury,
- Abstract summary: We investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance.
Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
- Score: 7.964926333613502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
Related papers
- Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data [22.933382649048113]
We propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data.
We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$times$ over human annotations.
arXiv Detail & Related papers (2024-06-25T20:37:16Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Improving Fairness and Robustness in End-to-End Speech Recognition
through unsupervised clustering [49.069298478971696]
We present a privacy preserving approach to improve fairness and robustness of end-to-end ASR.
We extract utterance level embeddings using a speaker ID model trained on a public dataset.
We use cluster IDs instead of speaker utterance embeddings as extra features during model training.
arXiv Detail & Related papers (2023-06-06T21:13:08Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Content-Context Factorized Representations for Automated Speech
Recognition [12.618527387900079]
We introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations.
We demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
arXiv Detail & Related papers (2022-05-19T21:34:40Z) - WLASL-LEX: a Dataset for Recognising Phonological Properties in American
Sign Language [2.814213966364155]
We build a large-scale dataset of American Sign Language signs annotated with six different phonological properties.
We investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties.
arXiv Detail & Related papers (2022-03-11T17:21:24Z) - Unsupervised Automatic Speech Recognition: A Review [2.6212127510234797]
We review the research literature to identify models and ideas that could lead to fully unsupervised ASR.
The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition.
arXiv Detail & Related papers (2021-06-09T08:33:20Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.