Related papers: Common Phone: A Multilingual Dataset for Robust Acoustic Modelling

Common Phone: A Multilingual Dataset for Robust Acoustic Modelling

URL: http://arxiv.org/abs/2201.05912v1
Date: Sat, 15 Jan 2022 19:02:46 GMT
Title: Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
Authors: Philipp Klumpp and Tom\'as Arias-Vergara and Paula Andrea P\'erez-Toro and Elmar N\"oth and Juan Rafael Orozco-Arroyave
Abstract summary: This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation.
Score: 13.930464898816652
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.

Related papers

CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing [5.466034990848432]
CUPE is a lightweight model that captures key phoneme features in just 120 milliseconds.<n> CUPE achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages.
arXiv Detail & Related papers (2025-08-21T07:27:10Z)
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z)
Phonetically rich corpus construction for a low-resourced language [0.0]
This paper proposes a novel approach to create a textitcorpus with broad phonetic coverage for a low-resourced language. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Using our algorithm, we achieve a 55.8% higher percentage of distinct triphones -- for samples of similar size.
arXiv Detail & Related papers (2024-02-08T16:36:11Z)
Natural language guidance of high-fidelity text-to-speech with synthetic annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z)
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z)
Differentiable Allophone Graphs for Language-Universal Speech Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings. We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z)
Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute. Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)
Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data. Our model is able to recognize unseen phonemes in the target language without any training data. It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.