Related papers: The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

URL: http://arxiv.org/abs/2311.08323v2
Date: Mon, 1 Apr 2024 23:10:59 GMT
Title: The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Authors: Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam,
Abstract summary: We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
Score: 7.0944623704102625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.

Related papers

Cross-Lingual IPA Contrastive Learning for Zero-Shot NER [7.788300011344196]
We investigate how reducing the phonemic representation gap in IPA transcription enables models trained on high-resource languages to perform effectively on low-resource languages. Our proposed dataset and methodology demonstrate a substantial average gain when compared to the best performing baseline.
arXiv Detail & Related papers (2025-03-10T11:52:33Z)
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet [21.000425416084706]
We present a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA) Our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We show that the quality of our universal speech-to-IPA models is close to that of human annotators.
arXiv Detail & Related papers (2023-08-07T21:29:51Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
Revisiting IPA-based Cross-lingual Text-to-speech [11.010299086810994]
International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC) In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance.
arXiv Detail & Related papers (2021-10-14T07:22:23Z)
Differentiable Allophone Graphs for Language-Universal Speech Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings. We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z)
Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models. We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z)
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
AlloVera: A Multilingual Allophone Database [137.3686036294502]
AlloVera provides mappings from 218 allophones to phonemes for 14 languages. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task.
arXiv Detail & Related papers (2020-04-17T02:02:18Z)
Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute. Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.