Diacritics Restoration using BERT with Analysis on Czech language
- URL: http://arxiv.org/abs/2105.11408v1
- Date: Mon, 24 May 2021 16:58:27 GMT
- Title: Diacritics Restoration using BERT with Analysis on Czech language
- Authors: Jakub N\'aplava, Milan Straka, Jana Strakov\'a
- Abstract summary: We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT.
We conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization.
- Score: 3.2729625923640278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new architecture for diacritics restoration based on
contextualized embeddings, namely BERT, and we evaluate it on 12 languages with
diacritics. Furthermore, we conduct a detailed error analysis on Czech, a
morphologically rich language with a high level of diacritization. Notably, we
manually annotate all mispredictions, showing that roughly 44% of them are
actually not errors, but either plausible variants (19%), or the system
corrections of erroneous data (25%). Finally, we categorize the real errors in
detail. We release the code at
https://github.com/ufal/bert-diacritics-restoration.
Related papers
- Assessing the Efficacy of Grammar Error Correction: A Human Evaluation
Approach in the Japanese Context [10.047123247001714]
We evaluate the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger)
With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger's performance on error correction with human expert correction as the benchmark.
Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset.
arXiv Detail & Related papers (2024-02-28T06:43:43Z) - Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning [90.13978453378768]
We introduce a comprehensive typology of factual errors in generated chart captions.
A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models.
Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies.
arXiv Detail & Related papers (2023-12-15T19:16:21Z) - GEE! Grammar Error Explanation with Large Language Models [64.16199533560017]
We propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences.
We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting.
We develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction.
arXiv Detail & Related papers (2023-11-16T02:45:47Z) - Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis [93.34894810865364]
Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments.
This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis.
We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
arXiv Detail & Related papers (2022-12-20T11:36:22Z) - A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages [2.5874041837241304]
Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector, DPC based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z) - Is Word Error Rate a good evaluation metric for Speech Recognition in
Indic Languages? [0.0]
We propose a new method for the calculation of error rates in Automatic Speech Recognition (ASR)
This new metric is for languages that contain half characters and where the same character can be written in different forms.
We implement our methodology in Hindi which is one of the main languages from Indic context.
arXiv Detail & Related papers (2022-03-30T18:32:08Z) - Correcting diacritics and typos with ByT5 transformer model [0.0]
People tend to forgo using diacritics and make typographical errors (typos) when typing.
In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models.
Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages.
arXiv Detail & Related papers (2022-01-31T13:52:51Z) - Czech Grammar Error Correction with a Large and Diverse Corpus [64.94696028072698]
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC)
The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts.
We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research.
arXiv Detail & Related papers (2022-01-14T18:20:47Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.