Related papers: Hierarchical Character Tagger for Short Text Spelling Error Correction

Hierarchical Character Tagger for Short Text Spelling Error Correction

URL: http://arxiv.org/abs/2109.14259v1
Date: Wed, 29 Sep 2021 08:04:34 GMT
Title: Hierarchical Character Tagger for Short Text Spelling Error Correction
Authors: Mengyi Gao, Canran Xu, Peng Shi
Abstract summary: We present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.
Score: 27.187562419222218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.

Related papers

Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs) We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences. This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z)
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora [0.0]
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. We show that a byte-level model enables higher correction quality than a subword approach.
arXiv Detail & Related papers (2023-05-29T06:35:40Z)
Towards Fine-Grained Information: Identifying the Type and Location of Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type. We build an FG-TED model to predict the textbf addition and textbfomission errors. Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z)
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition. We present a dataset preparation method based on the granular alignment of ITN examples. One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z)
Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model [12.53710938104476]
We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling.
arXiv Detail & Related papers (2022-02-16T16:21:53Z)
Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z)
Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow. Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z)
Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z)
Grammatical Error Correction as GAN-like Sequence Labeling [45.19453732703053]
We propose a GAN-like sequence labeling model, which consists of a grammatical error detector as a discriminator and a grammatical error labeler with Gumbel-Softmax sampling as a generator. Our results on several evaluation benchmarks demonstrate that our proposed approach is effective and improves the previous state-of-the-art baseline.
arXiv Detail & Related papers (2021-05-29T04:39:40Z)
Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios. Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.