Autocorrect for Estonian texts: final report from project EKTB25
- URL: http://arxiv.org/abs/2402.11671v1
- Date: Sun, 18 Feb 2024 18:20:57 GMT
- Title: Autocorrect for Estonian texts: final report from project EKTB25
- Authors: Agnes Luhtaru, Martin Vainikko, Krista Liin, Kais Allkivi-Metsoja,
Jaagup Kippar, Pille Eslon, Mark Fishel
- Abstract summary: The project was funded in 2021-2023 by the National Programme of Estonian Language Technology.
Its main aim was to develop spelling and grammar correction tools for the Estonian language.
There has been a breakthrough in large language models: GPT4, a commercial language model with Estonian-language support, has been created.
- Score: 0.6597195879147557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The project was funded in 2021-2023 by the National Programme of Estonian
Language Technology. Its main aim was to develop spelling and grammar
correction tools for the Estonian language. The main challenge was the very
small amount of available error correction data needed for such development. To
mitigate this, (1) we annotated more correction data for model training and
testing, (2) we tested transfer-learning, i.e. retraining machine learning
models created for other tasks, so as not to depend solely on correction data,
(3) we compared the developed method and model with alternatives, including
large language models. We also developed automatic evaluation, which can
calculate the accuracy and yield of corrections by error category, so that the
effectiveness of different methods can be compared in detail.
There has been a breakthrough in large language models during the project:
GPT4, a commercial language model with Estonian-language support, has been
created. We took into account the existence of the model when adjusting plans
and in the report we present a comparison with the ability of GPT4 to improve
the Estonian language text.
The final results show that the approach we have developed provides better
scores than GPT4 and the result is usable but not entirely reliable yet. The
report also contains ideas on how GPT4 and other major language models can be
implemented in the future, focusing on open-source solutions.
All results of this project are open-data/open-source, with licenses that
allow them to be used for purposes including commercial ones.
Related papers
- ArbESC+: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC [0.8643249539674613]
We present one of the first multi-system approaches for correcting grammatical errors in Arabic.<n>A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone.
arXiv Detail & Related papers (2025-11-18T08:06:28Z) - Aligning Knowledge Graphs and Language Models for Factual Accuracy [7.205708660952737]
We introduce ALIGNed-LLM, a simple yet effective approach to improve language models' factuality.<n>We use embeddings from a pre-trained Knowledge Graph Embedding (KGE) model, such as TransE, and a trainable projection layer to align entity and text embeddings.
arXiv Detail & Related papers (2025-07-17T08:15:50Z) - KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation [8.891724904033582]
We propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework.
Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet.
We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary.
arXiv Detail & Related papers (2025-01-04T15:59:33Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [47.753284211200665]
We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage.
This data consists of erroneous solution steps immediately followed by their corrections.
We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
arXiv Detail & Related papers (2024-08-29T06:49:20Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - Program-Aided Reasoners (better) Know What They Know [59.29201607431494]
We compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets.
Our results indicate that PAL leads to improved calibration in 75% of the instances.
arXiv Detail & Related papers (2023-11-16T04:17:49Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Does Correction Remain A Problem For Large Language Models? [63.24433996856764]
This paper investigates the role of correction in the context of large language models by conducting two experiments.
The first experiment focuses on correction as a standalone task, employing few-shot learning techniques with GPT-like models for error correction.
The second experiment explores the notion of correction as a preparatory task for other NLP tasks, examining whether large language models can tolerate and perform adequately on texts containing certain levels of noise or errors.
arXiv Detail & Related papers (2023-08-03T14:09:31Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - CoreLM: Coreference-aware Language Model Fine-Tuning [0.0]
We propose a Fine-Tuning framework, named CoreLM, that extends the architecture of current Pretrained Language Models.
We make available information outside the contextual space of the model, which results in a better Language Model for a fraction of the computational cost.
Our proposed model achieves a lower Perplexity in GUMBY and LAMBDADA datasets when compared to GPT2 and a fine-tuned version of GPT2 without any changes.
arXiv Detail & Related papers (2021-11-04T08:44:31Z) - Should we Stop Training More Monolingual Models, and Simply Use Machine
Translation Instead? [2.62121275102348]
We show that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages.
As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English.
arXiv Detail & Related papers (2021-04-21T10:21:24Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.