On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual
and Zero-shot Conditions
- URL: http://arxiv.org/abs/2103.01273v1
- Date: Mon, 1 Mar 2021 19:34:32 GMT
- Title: On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual
and Zero-shot Conditions
- Authors: Rob van der Goot, Ahmet \"Ust\"un, Barbara Plank
- Abstract summary: We compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting.
We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies.
- Score: 18.755176247223616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent complementary strands of research have shown that leveraging
information on the data source through encoding their properties into
embeddings can lead to performance increase when training a single model on
heterogeneous data sources. However, it remains unclear in which situations
these dataset embeddings are most effective, because they are used in a large
variety of settings, languages and tasks. Furthermore, it is usually assumed
that gold information on the data source is available, and that the test data
is from a distribution seen during training. In this work, we compare the
effect of dataset embeddings in mono-lingual settings, multi-lingual settings,
and with predicted data source label in a zero-shot setting. We evaluate on
three morphosyntactic tasks: morphological tagging, lemmatization, and
dependency parsing, and use 104 datasets, 66 languages, and two different
dataset grouping strategies. Performance increases are highest when the
datasets are of the same language, and we know from which distribution the
test-instance is drawn. In contrast, for setups where the data is from an
unseen distribution, performance increase vanishes.
Related papers
- A deep Natural Language Inference predictor without language-specific
training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset.
We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model.
The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z) - Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks.
We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data.
We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z) - Parsing with Pretrained Language Models, Multiple Datasets, and Dataset
Embeddings [13.097523786733872]
We compare two methods to embed datasets in a transformer-based multilingual dependency.
We confirm that performance increases are highest for small datasets and datasets with a low baseline score.
We show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
arXiv Detail & Related papers (2021-12-07T10:47:07Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Exploring Monolingual Data for Neural Machine Translation with Knowledge
Distillation [10.745228927771915]
We explore two types of monolingual data that can be included in knowledge distillation training for neural machine translation (NMT)
We find that source-side monolingual data improves model performance when evaluated by test-set originated from source-side.
We also show that it is not required to train the student model with the same data used by the teacher, as long as the domains are the same.
arXiv Detail & Related papers (2020-12-31T05:28:42Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.