Language Agnostic Data-Driven Inverse Text Normalization
- URL: http://arxiv.org/abs/2301.08506v2
- Date: Tue, 24 Jan 2023 00:48:14 GMT
- Title: Language Agnostic Data-Driven Inverse Text Normalization
- Authors: Szu-Jui Chen, Debjyoti Paul, Yutong Pang, Peng Su, Xuedong Zhang
- Abstract summary: inverse text normalization (ITN) problem attracts the attention of researchers from various fields.
Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited.
We propose a language-agnostic data-driven ITN framework to fill this gap.
- Score: 6.43601166279978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the emergence of automatic speech recognition (ASR) models, converting
the spoken form text (from ASR) to the written form is in urgent need. This
inverse text normalization (ITN) problem attracts the attention of researchers
from various fields. Recently, several works show that data-driven ITN methods
can output high-quality written form text. Due to the scarcity of labeled
spoken-written datasets, the studies on non-English data-driven ITN are quite
limited. In this work, we propose a language-agnostic data-driven ITN framework
to fill this gap. Specifically, we leverage the data augmentation in
conjunction with neural machine translated data for low resource languages.
Moreover, we design an evaluation method for language agnostic ITN model when
only English data is available. Our empirical evaluation shows this language
agnostic modeling approach is effective for low resource languages while
preserving the performance for high resource languages.
Related papers
- Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - TEXTRON: Weakly Supervised Multilingual Text Detection through Data
Programming [21.88026116276415]
Text detection is a challenging problem in the field of computer vision (CV)
There is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts.
We propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework.
arXiv Detail & Related papers (2024-02-15T09:18:18Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.