Related papers: ZIPA: A family of efficient models for multilingual phone recognition

ZIPA: A family of efficient models for multilingual phone recognition

URL: http://arxiv.org/abs/2505.23170v1
Date: Thu, 29 May 2025 07:08:23 GMT
Title: ZIPA: A family of efficient models for multilingual phone recognition
Authors: Jian Zhu, Farhan Samir, Eleanor Chodroff, David R. Mortensen,
Abstract summary: ZIPA is a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition.<n>We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions.
Score: 13.823868439481737
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

Related papers

Cross-Lingual IPA Contrastive Learning for Zero-Shot NER [7.788300011344196]
We investigate how reducing the phonemic representation gap in IPA transcription enables models trained on high-resource languages to perform effectively on low-resource languages.<n>Our proposed dataset and methodology demonstrate a substantial average gain when compared to the best performing baseline.
arXiv Detail & Related papers (2025-03-10T11:52:33Z)
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model [0.0]
PolyIPA is a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration.<n>Two helper models are developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations.<n>The model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies.
arXiv Detail & Related papers (2024-12-12T09:29:59Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision [16.992058149317753]
This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient automatic speech recognition (MCL-ASR)<n>We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models.<n>Experiments demonstrate the advantages of phoneme-based models for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.
arXiv Detail & Related papers (2024-06-04T09:56:05Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language [7.0944623704102625]
We show that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences.
arXiv Detail & Related papers (2023-11-14T17:09:07Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)
Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting. We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z)
Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language. We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z)
Revisiting IPA-based Cross-lingual Text-to-speech [11.010299086810994]
International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC) In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance.
arXiv Detail & Related papers (2021-10-14T07:22:23Z)
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute. Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.