Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation
- URL: http://arxiv.org/abs/2201.01845v1
- Date: Wed, 5 Jan 2022 22:19:10 GMT
- Title: Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation
- Authors: Zoey Liu, Emily Prud'hommeaux
- Abstract summary: In low-resource scenarios, artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental.
We compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families.
The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size.
- Score: 4.339613097080119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Common designs of model evaluation typically focus on monolingual settings,
where different models are compared according to their performance on a single
data set that is assumed to be representative of all possible data for the task
at hand. While this may be reasonable for a large data set, this assumption is
difficult to maintain in low-resource scenarios, where artifacts of the data
collection can yield data sets that are outliers, potentially making
conclusions about model performance coincidental. To address these concerns, we
investigate model generalizability in crosslinguistic low-resource scenarios.
Using morphological segmentation as the test case, we compare three broad
classes of models with different parameterizations, taking data from 11
languages across 6 language families. In each experimental setting, we evaluate
all models on a first data set, then examine their performance consistency when
introducing new randomly sampled data sets with the same size and when applying
the trained models to unseen test sets of varying sizes. The results
demonstrate that the extent of model generalization depends on the
characteristics of the data set, and does not necessarily rely heavily on the
data set size. Among the characteristics that we studied, the ratio of morpheme
overlap and that of the average number of morphemes per word between the
training and test sets are the two most prominent factors. Our findings suggest
that future work should adopt random sampling to construct data sets with
different sizes in order to make more responsible claims about model
evaluation.
Related papers
- Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - The Effect of Data Partitioning Strategy on Model Generalizability: A Case Study of Morphological Segmentation [6.979385830035607]
We use data from 19 languages, including ten indigenous or endangered languages across 10 language families with diverse morphological systems.
We conduct large-scale experimentation with varying sized combinations of training and evaluation sets and new test data.
Our results show that, when faced with new test data, models trained from random splits are able to achieve higher numerical scores.
arXiv Detail & Related papers (2024-04-14T22:22:58Z) - A Case for Dataset Specific Profiling [0.9023847175654603]
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets.
With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications.
For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources.
arXiv Detail & Related papers (2022-08-01T18:38:05Z) - Identifying the Context Shift between Test Benchmarks and Production
Data [1.2259552039796024]
There exists a performance gap between machine learning models' accuracy on dataset benchmarks and real-world production data.
We outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors.
We present two case-studies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors.
arXiv Detail & Related papers (2022-07-03T14:54:54Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.