The Impact of Data Corruption on Named Entity Recognition for
Low-resourced Languages
- URL: http://arxiv.org/abs/2208.04568v2
- Date: Sat, 25 Nov 2023 19:52:13 GMT
- Title: The Impact of Data Corruption on Named Entity Recognition for
Low-resourced Languages
- Authors: Manuel Fokam, Michael Beukman
- Abstract summary: Data availability and quality are major challenges in natural language processing for low-resourced languages.
We measure the effect of data quantity and quality on the performance of pre-trained language models in a low-resourced setting.
- Score: 0.10641561702689348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data availability and quality are major challenges in natural language
processing for low-resourced languages. In particular, there is significantly
less data available than for higher-resourced languages. This data is also
often of low quality, rife with errors, invalid text or incorrect annotations.
Many prior works focus on dealing with these problems, either by generating
synthetic data, or filtering out low-quality parts of datasets. We instead
investigate these factors more deeply, by systematically measuring the effect
of data quantity and quality on the performance of pre-trained language models
in a low-resourced setting. Our results show that having fewer
completely-labelled sentences is significantly better than having more
sentences with missing labels; and that models can perform remarkably well with
only 10% of the training data. Importantly, these results are consistent across
ten low-resource languages, English, and four pre-trained models.
Related papers
- ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams [16.172599163455693]
We leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language.
Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models.
Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs.
arXiv Detail & Related papers (2024-10-23T14:18:25Z) - Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good? [2.492943108520374]
This study investigates whether the monolingual data can also be too little and if this reduction, based on quality, has any effect on the performance of the translation model.
Experiments have shown that on English-German low-resource NMT, it is often better to select only the most useful additional data, based on quality or to the domain of the test data, than utilizing all of the available data.
arXiv Detail & Related papers (2024-10-17T17:20:40Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Learning Translation Quality Evaluation on Low Resource Languages from
Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators.
We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z) - Detecting Label Errors using Pre-Trained Language Models [37.82128817976385]
We show that large pre-trained language models are extremely capable of identifying label errors in datasets.
We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP.
arXiv Detail & Related papers (2022-05-25T11:59:39Z) - ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling [57.80052276304937]
This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task.
We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.
arXiv Detail & Related papers (2022-01-04T20:08:17Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.