PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
- URL: http://arxiv.org/abs/2412.09102v1
- Date: Thu, 12 Dec 2024 09:29:59 GMT
- Title: PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
- Authors: Davor Lauc,
- Abstract summary: PolyIPA is a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration.<n>Two helper models are developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations.<n>The model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.
Related papers
- Cross-Lingual IPA Contrastive Learning for Zero-Shot NER [7.788300011344196]
We investigate how reducing the phonemic representation gap in IPA transcription enables models trained on high-resource languages to perform effectively on low-resource languages.
Our proposed dataset and methodology demonstrate a substantial average gain when compared to the best performing baseline.
arXiv Detail & Related papers (2025-03-10T11:52:33Z) - PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation [0.0]
We introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of large language models (LLMs)
Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query's language and selecting the corresponding trigger tokens which are prepended to the prompt during inference.
We perform experiments on two 1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.
arXiv Detail & Related papers (2025-02-27T04:41:22Z) - Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset [13.041053110012246]
We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets.
We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits.
arXiv Detail & Related papers (2024-10-05T21:41:49Z) - A New Method for Cross-Lingual-based Semantic Role Labeling [5.992526851963307]
A deep learning algorithm is proposed to train semantic role labeling in English and Persian.
The results show significant improvements compared to Niksirt et al.'s model.
The development of cross-lingual methods for semantic role labeling holds promise.
arXiv Detail & Related papers (2024-08-28T16:06:12Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection.
We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z) - Scaling Up Deliberation for Multilingual ASR [36.860327600638705]
We investigate second-pass deliberation for multilingual speech recognition.
Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the decoder attends to multilingual text and audio.
We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model.
arXiv Detail & Related papers (2022-10-11T21:07:00Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Language Detection Engine for Multilingual Texting on Mobile Devices [0.415623340386296]
More than 2 billion mobile users worldwide type in multiple languages in the soft keyboard.
On a monolingual keyboard, 38% of falsely auto-corrected words are valid in another language.
We present a fast, light-weight and accurate Language Detection Engine (LDE) for multilingual typing.
arXiv Detail & Related papers (2021-01-07T16:49:47Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.