Eeny, meeny, miny, moe. How to choose data for morphological inflection
- URL: http://arxiv.org/abs/2210.14465v1
- Date: Wed, 26 Oct 2022 04:33:18 GMT
- Title: Eeny, meeny, miny, moe. How to choose data for morphological inflection
- Authors: Saliha Muradoglu and Mans Hulden
- Abstract summary: This paper explores four sampling strategies for the task of morphological inflection using a Transformer model.
We investigate the robustness of each strategy across 30 typologically diverse languages.
Our results show a clear benefit to selecting data based on model confidence and entropy.
- Score: 8.914777617216862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data scarcity is a widespread problem in numerous natural language processing
(NLP) tasks for low-resource languages. Within morphology, the labour-intensive
work of tagging/glossing data is a serious bottleneck for both NLP and language
documentation. Active learning (AL) aims to reduce the cost of data annotation
by selecting data that is most informative for improving the model. In this
paper, we explore four sampling strategies for the task of morphological
inflection using a Transformer model: a pair of oracle experiments where data
is chosen based on whether the model already can or cannot inflect the test
forms correctly, as well as strategies based on high/low model confidence,
entropy, as well as random selection. We investigate the robustness of each
strategy across 30 typologically diverse languages. We also perform a more
in-depth case study of Nat\"ugu. Our results show a clear benefit to selecting
data based on model confidence and entropy. Unsurprisingly, the oracle
experiment, where only incorrectly handled forms are chosen for further
training, which is presented as a proxy for linguist/language consultant
feedback, shows the most improvement. This is followed closely by choosing
low-confidence and high-entropy predictions. We also show that despite the
conventional wisdom of larger data sets yielding better accuracy, introducing
more instances of high-confidence or low-entropy forms, or forms that the model
can already inflect correctly, can reduce model performance.
Related papers
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models.
We present a sequential selection method that identifies critically important information within a dataset.
We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce
Data Annotation Required in Visual Commonsense Tasks [3.42658286826597]
We analyze different prompt-based fine-tuning techniques to improve results on both language and multimodal causal transformer models.
Our results show that by simple model-agnostic prompt-based fine-tuning, comparable results can be reached by only using 35%-40% of the fine-tuning training dataset.
arXiv Detail & Related papers (2022-04-25T18:56:55Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - How Does Data Corruption Affect Natural Language Understanding Models? A
Study on GLUE datasets [4.645287693363387]
We show that performance remains high for most GLUE tasks when the models are fine-tuned or tested on corrupted data.
Our proposed data transformations can be used as a diagnostic tool for assessing the extent to which a specific dataset constitutes a proper testbed for evaluating models' language understanding capabilities.
arXiv Detail & Related papers (2022-01-12T13:35:53Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Understanding and Improving Lexical Choice in Non-Autoregressive
Translation [98.11249019844281]
We propose to expose the raw data to NAT models to restore the useful information of low-frequency words.
Our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.
arXiv Detail & Related papers (2020-12-29T03:18:50Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.