Related papers: A Chat About Boring Problems: Studying GPT-based text normalization

A Chat About Boring Problems: Studying GPT-based text normalization

URL: http://arxiv.org/abs/2309.13426v2
Date: Wed, 17 Jan 2024 16:36:58 GMT
Title: A Chat About Boring Problems: Studying GPT-based text normalization
Authors: Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg
Abstract summary: We show the capacity of Large-Language Models for text normalization in few-shot scenarios. We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems. We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
Score: 22.64840464909988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language models. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models (LLM) for text normalization in few-shot scenarios. Combining self-consistency reasoning with linguistic-informed prompt engineering, we find LLM based text normalization to achieve error rates around 40\% lower than top normalization systems. Further, upon error analysis, we note key limitations in the conventional design of text normalization tasks. We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0. Through this new framework, we can identify strengths and weaknesses of GPT-based TN, opening opportunities for future work.

Related papers

Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z)
Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents. Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z)
Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation [114.80518907146792]
We investigate the potential of utilizing large-scale language models, such as GPT-k, to improve the prompt editing process for text-to-image generation. We compare the common edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting T2I, and examine factors that may influence this process.
arXiv Detail & Related papers (2023-05-18T21:53:58Z)
Text normalization for low-resource languages: the case of Ligurian [8.27203430509479]
We show that a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian.
arXiv Detail & Related papers (2022-06-16T00:37:55Z)
Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization [13.929356163132558]
We propose a new hybrid approach that combines the benefits of rule-based and neural systems. First, a non-deterministic WFST outputs all normalization candidates, and then a neural language model picks the best one. It achieves comparable or better results than existing state-of-the-art TN models.
arXiv Detail & Related papers (2022-03-29T21:34:35Z)
Sequence-to-Sequence Lexical Normalization with Multilingual Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z)
Fine-tuning GPT-3 for Russian Text Summarization [77.34726150561087]
This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries. We evaluate the resulting texts with a set of metrics, showing that our solution can surpass the state-of-the-art model's performance without additional changes in architecture or loss function.
arXiv Detail & Related papers (2021-08-07T19:01:40Z)
mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks. We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start [125.23550801424328]
Universal Few-shot textual Entailment (UFO-Entail) We introduce Universal Few-shot textual Entailment (UFO-Entail) We demonstrate that this framework enables a pretrained entailment model to work well on new entailment domains in a few-shot setting.
arXiv Detail & Related papers (2020-10-06T09:50:25Z)
Normalizing Text using Language Modelling based on Phonetics and String Similarity [0.0]
We propose a new robust model to perform text normalization. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z)
Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios. Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.