High-Resource Methodological Bias in Low-Resource Investigations
- URL: http://arxiv.org/abs/2211.07534v1
- Date: Mon, 14 Nov 2022 17:04:38 GMT
- Title: High-Resource Methodological Bias in Low-Resource Investigations
- Authors: Maartje ter Hoeve, David Grangier, Natalie Schluter
- Abstract summary: We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets.
We conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.
- Score: 27.419604203739052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The central bottleneck for low-resource NLP is typically regarded to be the
quantity of accessible data, overlooking the contribution of data quality. This
is particularly seen in the development and evaluation of low-resource systems
via down sampling of high-resource language data. In this work we investigate
the validity of this approach, and we specifically focus on two well-known NLP
tasks for our empirical investigations: POS-tagging and machine translation. We
show that down sampling from a high-resource language results in datasets with
different properties than the low-resource datasets, impacting the model
performance for both POS-tagging and machine translation. Based on these
results we conclude that naive down sampling of datasets results in a biased
view of how well these systems work in a low-resource scenario.
Related papers
- Order Matters in the Presence of Dataset Imbalance for Multilingual
Learning [53.74649778447903]
We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks.
We show its improvements in neural machine translation (NMT) and multi-lingual language modeling.
arXiv Detail & Related papers (2023-12-11T05:46:57Z) - Towards Realistic Low-resource Relation Extraction: A Benchmark with
Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings.
We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z) - Efficient Methods for Natural Language Processing: A Survey [76.34572727185896]
This survey synthesizes and relates current methods and findings in efficient NLP.
We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
arXiv Detail & Related papers (2022-08-31T20:32:35Z) - Data Augmentation for Low-Resource Named Entity Recognition Using
Backtranslation [1.195496689595016]
We adapt backtranslation to generate high quality and linguistically diverse synthetic data for low-resource named entity recognition.
We perform experiments on two datasets from the materials science (MaSciP) and biomedical domains (S800)
arXiv Detail & Related papers (2021-08-26T10:56:39Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Variational Information Bottleneck for Effective Low-Resource
Fine-Tuning [40.66716433803935]
We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks.
We show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets.
arXiv Detail & Related papers (2021-06-10T03:08:13Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Establishing Baselines for Text Classification in Low-Resource Languages [0.0]
We introduce two previously unreleased datasets as benchmark datasets for text classification.
Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting.
Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced.
arXiv Detail & Related papers (2020-05-05T11:17:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.