Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic
N-Gram Rule Generation for Spelling Normalization in Filipino
- URL: http://arxiv.org/abs/2210.02675v1
- Date: Thu, 6 Oct 2022 04:41:26 GMT
- Title: Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic
N-Gram Rule Generation for Spelling Normalization in Filipino
- Authors: Lorenzo Jaime Yu Flores
- Abstract summary: 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications.
We propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With 84.75 million Filipinos online, the ability for models to process online
text is crucial for developing Filipino NLP applications. To this end, spelling
correction is a crucial preprocessing step for downstream processing. However,
the lack of data prevents the use of language models for this task. In this
paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic
rule extraction. We train the model on 300 samples, and show that despite
limited training data, it achieves good performance and outperforms other deep
learning approaches in terms of accuracy and edit distance. Moreover, the model
(1) requires little compute power, (2) trains in little time, thus allowing for
retraining, and (3) is easily interpretable, allowing for direct
troubleshooting, highlighting the success of traditional approaches over more
complex deep learning models in settings where data is unavailable.
Related papers
- KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation [8.891724904033582]
We propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework.
Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet.
We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary.
arXiv Detail & Related papers (2025-01-04T15:59:33Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Back to Patterns: Efficient Japanese Morphological Analysis with
Feature-Sequence Trie [9.49725486620342]
This study revisits the fastest pattern-based NLP methods to make them as accurate as possible.
The proposed method induces reliable patterns from a morphological dictionary and annotated data.
Experimental results on two standard datasets confirm that the method exhibits comparable accuracy to learning-based baselines.
arXiv Detail & Related papers (2023-05-30T14:00:30Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - What Stops Learning-based 3D Registration from Working in the Real
World? [53.68326201131434]
This work identifies the sources of 3D point cloud registration failures, analyze the reasons behind them, and propose solutions.
Ultimately, this translates to a best-practice 3D registration network (BPNet), constituting the first learning-based method able to handle previously-unseen objects in real-world data.
Our model generalizes to real data without any fine-tuning, reaching an accuracy of up to 67% on point clouds of unseen objects obtained with a commercial sensor.
arXiv Detail & Related papers (2021-11-19T19:24:27Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Aligning the Pretraining and Finetuning Objectives of Language Models [1.0965065178451106]
We show that aligning the pretraining objectives to the finetuning objectives in language model training significantly improves the finetuning task performance.
We name finetuning small language models in the presence of hundreds of training examples or less "Few Example learning"
In practice, Few Example Learning enabled by objective alignment not only saves human labeling costs, but also makes it possible to leverage language models in more real-time applications.
arXiv Detail & Related papers (2020-02-05T21:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.