Capitalization Normalization for Language Modeling with an Accurate and
Efficient Hierarchical RNN Model
- URL: http://arxiv.org/abs/2202.08171v1
- Date: Wed, 16 Feb 2022 16:21:53 GMT
- Title: Capitalization Normalization for Language Modeling with an Accurate and
Efficient Hierarchical RNN Model
- Authors: Hao Zhang and You-Chi Cheng and Shankar Kumar and W. Ronny Huang and
Mingqing Chen and Rajiv Mathews
- Abstract summary: We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model.
We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling.
- Score: 12.53710938104476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Capitalization normalization (truecasing) is the task of restoring the
correct case (uppercase or lowercase) of noisy text. We propose a fast,
accurate and compact two-level hierarchical word-and-character-based recurrent
neural network model. We use the truecaser to normalize user-generated text in
a Federated Learning framework for language modeling. A case-aware language
model trained on this normalized text achieves the same perplexity as a model
trained on text with gold capitalization. In a real user A/B experiment, we
demonstrate that the improvement translates to reduced prediction error rates
in a virtual keyboard application. Similarly, in an ASR language model fusion
experiment, we show reduction in uppercase character error rate and word error
rate.
Related papers
- Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Byte-Level Grammatical Error Correction Using Synthetic and Curated
Corpora [0.0]
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text.
We show that a byte-level model enables higher correction quality than a subword approach.
arXiv Detail & Related papers (2023-05-29T06:35:40Z) - Hierarchical Phrase-based Sequence-to-Sequence Learning [94.10257313923478]
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference.
Our approach trains two models: a discriminative derivation based on a bracketing grammar whose tree hierarchically aligns source and target phrases, and a neural seq2seq model that learns to translate the aligned phrases one-by-one.
arXiv Detail & Related papers (2022-11-15T05:22:40Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Position-Invariant Truecasing with a Word-and-Character Hierarchical
Recurrent Neural Network [10.425277173548212]
We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model.
We also address the problem of truecasing while ignoring token positions in the sentence.
arXiv Detail & Related papers (2021-08-26T17:54:35Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.