Related papers: Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino

Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino

URL: http://arxiv.org/abs/2210.02675v1
Date: Thu, 6 Oct 2022 04:41:26 GMT
Title: Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino
Authors: Lorenzo Jaime Yu Flores
Abstract summary: 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. We propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction. We train the model on 300 samples, and show that despite limited training data, it achieves good performance and outperforms other deep learning approaches in terms of accuracy and edit distance. Moreover, the model (1) requires little compute power, (2) trains in little time, thus allowing for retraining, and (3) is easily interpretable, allowing for direct troubleshooting, highlighting the success of traditional approaches over more complex deep learning models in settings where data is unavailable.

Related papers

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Large Language Model (LLM) pretraining traditionally relies on autoregressive language modeling on randomly sampled data blocks from web-scale datasets. We take inspiration from human learning techniques like spaced repetition to hypothesize that random data sampling for LLMs leads to high training cost and low quality models which tend to forget data. In order to effectively commit web-scale information to long-term memory, we propose the LFR (Learn, Focus, and Review) pedagogy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts. We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z)
Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie [9.49725486620342]
This study revisits the fastest pattern-based NLP methods to make them as accurate as possible. The proposed method induces reliable patterns from a morphological dictionary and annotated data. Experimental results on two standard datasets confirm that the method exhibits comparable accuracy to learning-based baselines.
arXiv Detail & Related papers (2023-05-30T14:00:30Z)
LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z)
Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data. Such mandates require effort both in terms of data as well as model retraining. continuous removal of data and model retraining steps do not scale. We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z)
What Stops Learning-based 3D Registration from Working in the Real World? [53.68326201131434]
This work identifies the sources of 3D point cloud registration failures, analyze the reasons behind them, and propose solutions. Ultimately, this translates to a best-practice 3D registration network (BPNet), constituting the first learning-based method able to handle previously-unseen objects in real-world data. Our model generalizes to real data without any fine-tuning, reaching an accuracy of up to 67% on point clouds of unseen objects obtained with a commercial sensor.
arXiv Detail & Related papers (2021-11-19T19:24:27Z)
Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models. Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
BERT Fine-Tuning for Sentiment Analysis on Indonesian Mobile Apps Reviews [1.5749416770494706]
This study examines the effectiveness of fine-tuning BERT for sentiment analysis using two different pre-trained models. The dataset used is Indonesian user reviews of the ten best apps in 2020 in Google Play sites. Two training data labeling approaches were also tested to determine the effectiveness of the model, which is score-based and lexicon-based.
arXiv Detail & Related papers (2021-07-14T16:00:15Z)
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z)
Aligning the Pretraining and Finetuning Objectives of Language Models [1.0965065178451106]
We show that aligning the pretraining objectives to the finetuning objectives in language model training significantly improves the finetuning task performance. We name finetuning small language models in the presence of hundreds of training examples or less "Few Example learning" In practice, Few Example Learning enabled by objective alignment not only saves human labeling costs, but also makes it possible to leverage language models in more real-time applications.
arXiv Detail & Related papers (2020-02-05T21:40:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.