tasksource: A Dataset Harmonization Framework for Streamlined NLP
Multi-Task Learning and Evaluation
- URL: http://arxiv.org/abs/2301.05948v3
- Date: Tue, 16 May 2023 08:19:44 GMT
- Title: tasksource: A Dataset Harmonization Framework for Streamlined NLP
Multi-Task Learning and Evaluation
- Authors: Damien Sileo
- Abstract summary: We release a dataset annotation framework and dataset annotations for more than 500 English tasks.
These annotations include metadata, such as the names of columns to be used as input or labels for all datasets.
We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.
- Score: 2.869669835645836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting
opportunities for language model training and evaluation. However, datasets for
a specific task type often have different schemas, making harmonization
challenging. Multi-task training or evaluation necessitates manual work to fit
data into task templates. Several initiatives independently tackle this issue
by releasing harmonized datasets or providing harmonization codes to preprocess
datasets into a consistent format. We identify patterns across previous
preprocessing efforts, such as column name mapping and extracting specific
sub-fields from structured data in a column. We then propose a structured
annotation framework that ensures our annotations are fully exposed and not
hidden within unstructured code. We release a dataset annotation framework and
dataset annotations for more than 500 English
tasks\footnote{\url{https://github.com/sileod/tasksource}}. These annotations
include metadata, such as the names of columns to be used as input or labels
for all datasets, which can save time for future dataset preprocessing,
regardless of whether our framework is utilized. We fine-tune a multi-task text
encoder on all tasksource tasks, outperforming every publicly available text
encoder of comparable size in an external evaluation.
Related papers
- The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Thinking Like an Annotator: Generation of Dataset Labeling Instructions [59.603239753484345]
We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions.
We take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples.
This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality.
arXiv Detail & Related papers (2023-06-24T18:32:48Z) - DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model [42.49953563682122]
We propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.
We use a shared representation (mask proposals with class predictions) for all tasks.
We also leverage weak-supervision, allowing our segmentation model to benefit from cheaper bounding box annotations.
arXiv Detail & Related papers (2023-06-02T17:59:24Z) - Cross-TOP: Zero-Shot Cross-Schema Task-Oriented Parsing [5.5947246682911205]
Cross-TOP is a zero-shot method for complex semantic parsing in a given vertical.
We show that Cross-TOP can achieve high accuracy on a previously unseen task without requiring any additional training data.
arXiv Detail & Related papers (2022-06-10T20:50:08Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic
Conditional Random Fields [67.51177964010967]
We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks.
We find that explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.
arXiv Detail & Related papers (2020-05-01T07:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.