Parsing with Pretrained Language Models, Multiple Datasets, and Dataset
Embeddings
- URL: http://arxiv.org/abs/2112.03625v1
- Date: Tue, 7 Dec 2021 10:47:07 GMT
- Title: Parsing with Pretrained Language Models, Multiple Datasets, and Dataset
Embeddings
- Authors: Rob van der Goot and Miryam de Lhoneux
- Abstract summary: We compare two methods to embed datasets in a transformer-based multilingual dependency.
We confirm that performance increases are highest for small datasets and datasets with a low baseline score.
We show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
- Score: 13.097523786733872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With an increase of dataset availability, the potential for learning from a
variety of data sources has increased. One particular method to improve
learning from multiple data sources is to embed the data source during
training. This allows the model to learn generalizable features as well as
distinguishing features between datasets. However, these dataset embeddings
have mostly been used before contextualized transformer-based embeddings were
introduced in the field of Natural Language Processing. In this work, we
compare two methods to embed datasets in a transformer-based multilingual
dependency parser, and perform an extensive evaluation. We show that: 1)
embedding the dataset is still beneficial with these models 2) performance
increases are highest when embedding the dataset at the encoder level 3)
unsurprisingly, we confirm that performance increases are highest for small
datasets and datasets with a low baseline score. 4) we show that training on
the combination of all datasets performs similarly to designing smaller
clusters based on language-relatedness.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Combining datasets to increase the number of samples and improve model
fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp)
In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets.
Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual
and Zero-shot Conditions [18.755176247223616]
We compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting.
We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies.
arXiv Detail & Related papers (2021-03-01T19:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.