PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech
- URL: http://arxiv.org/abs/2511.03080v1
- Date: Wed, 05 Nov 2025 00:06:35 GMT
- Title: PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech
- Authors: Michel Wong, Ali Alshehri, Sophia Kao, Haotian He,
- Abstract summary: Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems.<n>We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs)<n>We present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages.
- Score: 1.9288174612754012
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.
Related papers
- SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution [0.0]
Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages.<n>We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation.
arXiv Detail & Related papers (2025-10-27T21:39:07Z) - SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision [14.416218321809824]
Sign language translation (SLT) is typically trained with text in a single spoken language.<n>We employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT.<n>Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.
arXiv Detail & Related papers (2025-10-22T09:17:31Z) - Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach [51.95266411355865]
Autoregressive language models are vulnerable to orthographic attacks.<n>This vulnerability stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings.<n>We propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images.
arXiv Detail & Related papers (2025-08-28T20:48:38Z) - Evaluation of NMT-Assisted Grammar Transfer for a Multi-Language Configurable Data-to-Text System [0.04947896909360667]
One approach for multilingual data-to-text generation is to translate grammatical configurations upfront from the source language into each target language.<n>In this paper, we describe a rule-based NLG implementation where the configuration is translated by Neural Machine Translation (NMT) combined with a one-time human review.<n>Our evaluation on the SportSett:Basketball dataset shows that our NLG system performs well, underlining its grammatical correctness in translation tasks.
arXiv Detail & Related papers (2025-01-27T15:25:26Z) - Zero-shot Cross-lingual Transfer of Prompt-based Tuning with a Unified
Multilingual Prompt [98.26682501616024]
We propose a novel model that uses a unified prompt for all languages, called UniPrompt.
The unified prompt is computation by a multilingual PLM to produce language-independent representation.
Our proposed methods can significantly outperform the strong baselines across different languages.
arXiv Detail & Related papers (2022-02-23T11:57:52Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - SML: a new Semantic Embedding Alignment Transformer for efficient
cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present.
NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise.
In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z) - Neural Inverse Text Normalization [11.240669509034298]
We propose an efficient and robust neural solution for inverse text normalization.
We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them.
A transformer based model infused with pretraining consistently achieves a lower WER across several datasets.
arXiv Detail & Related papers (2021-02-12T07:53:53Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Few-shot Natural Language Generation for Task-Oriented Dialog [113.07438787659859]
We present FewShotWoz, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems.
We develop the SC-GPT model, which is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability.
Experiments on FewShotWoz and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods.
arXiv Detail & Related papers (2020-02-27T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.