SynthBio: A Case Study in Human-AI Collaborative Curation of Text
Datasets
- URL: http://arxiv.org/abs/2111.06467v1
- Date: Thu, 11 Nov 2021 21:21:48 GMT
- Title: SynthBio: A Case Study in Human-AI Collaborative Curation of Text
Datasets
- Authors: Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy
Coenen, Sebastian Gehrmann
- Abstract summary: We introduce a novel method for efficient dataset curation.
We use a large language model to provide seed generations to human raters.
We show that our dataset of fictional biographies is less noisy than WikiBio.
- Score: 26.75449546181059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP researchers need more, higher-quality text datasets. Human-labeled
datasets are expensive to collect, while datasets collected via automatic
retrieval from the web such as WikiBio are noisy and can include undesired
biases. Moreover, data sourced from the web is often included in datasets used
to pretrain models, leading to inadvertent cross-contamination of training and
test sets. In this work we introduce a novel method for efficient dataset
curation: we use a large language model to provide seed generations to human
raters, thereby changing dataset authoring from a writing task to an editing
task. We use our method to curate SynthBio - a new evaluation set for WikiBio -
composed of structured attribute lists describing fictional individuals, mapped
to natural language biographies. We show that our dataset of fictional
biographies is less noisy than WikiBio, and also more balanced with respect to
gender and nationality.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language [7.59001382786429]
This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German.
Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset.
We train several state-of-the-art machine learning models on the automatically created dataset and release them as well.
arXiv Detail & Related papers (2024-03-25T19:40:26Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - BigBIO: A Framework for Data-Centric Biomedical Natural Language
Processing [13.30221348538759]
We introduce BigBIO, a community library of 126+ biomedical NLP datasets.
BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata.
We discuss our process for task schema, data auditing, contribution guidelines, and outline two illustrative use cases.
arXiv Detail & Related papers (2022-06-30T07:15:45Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.