A Chat About Boring Problems: Studying GPT-based text normalization
- URL: http://arxiv.org/abs/2309.13426v2
- Date: Wed, 17 Jan 2024 16:36:58 GMT
- Title: A Chat About Boring Problems: Studying GPT-based text normalization
- Authors: Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly
Lavrukhin, Evelina Bakhturina, Boris Ginsburg
- Abstract summary: We show the capacity of Large-Language Models for text normalization in few-shot scenarios.
We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems.
We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
- Score: 22.64840464909988
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text normalization - the conversion of text from written to spoken form - is
traditionally assumed to be an ill-formed task for language models. In this
work, we argue otherwise. We empirically show the capacity of Large-Language
Models (LLM) for text normalization in few-shot scenarios. Combining
self-consistency reasoning with linguistic-informed prompt engineering, we find
LLM based text normalization to achieve error rates around 40\% lower than top
normalization systems. Further, upon error analysis, we note key limitations in
the conventional design of text normalization tasks. We create a new taxonomy
of text normalization errors and apply it to results from GPT-3.5-Turbo and
GPT-4.0. Through this new framework, we can identify strengths and weaknesses
of GPT-based TN, opening opportunities for future work.
Related papers
- Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Collaborative Generative AI: Integrating GPT-k for Efficient Editing in
Text-to-Image Generation [114.80518907146792]
We investigate the potential of utilizing large-scale language models, such as GPT-k, to improve the prompt editing process for text-to-image generation.
We compare the common edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting T2I, and examine factors that may influence this process.
arXiv Detail & Related papers (2023-05-18T21:53:58Z) - Text normalization for low-resource languages: the case of Ligurian [8.27203430509479]
We show that a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian.
arXiv Detail & Related papers (2022-06-16T00:37:55Z) - Shallow Fusion of Weighted Finite-State Transducer and Language Model
for Text Normalization [13.929356163132558]
We propose a new hybrid approach that combines the benefits of rule-based and neural systems.
First, a non-deterministic WFST outputs all normalization candidates, and then a neural language model picks the best one.
It achieves comparable or better results than existing state-of-the-art TN models.
arXiv Detail & Related papers (2022-03-29T21:34:35Z) - Sequence-to-Sequence Lexical Normalization with Multilingual
Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication.
This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data.
We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z) - Fine-tuning GPT-3 for Russian Text Summarization [77.34726150561087]
This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries.
We evaluate the resulting texts with a set of metrics, showing that our solution can surpass the state-of-the-art model's performance without additional changes in architecture or loss function.
arXiv Detail & Related papers (2021-08-07T19:01:40Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Universal Natural Language Processing with Limited Annotations: Try
Few-shot Textual Entailment as a Start [125.23550801424328]
Universal Few-shot textual Entailment (UFO-Entail)
We introduce Universal Few-shot textual Entailment (UFO-Entail)
We demonstrate that this framework enables a pretrained entailment model to work well on new entailment domains in a few-shot setting.
arXiv Detail & Related papers (2020-10-06T09:50:25Z) - Normalizing Text using Language Modelling based on Phonetics and String
Similarity [0.0]
We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z) - Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.
Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.