A synthetic data approach for domain generalization of NLI models
- URL: http://arxiv.org/abs/2402.12368v2
- Date: Fri, 28 Jun 2024 10:36:27 GMT
- Title: A synthetic data approach for domain generalization of NLI models
- Authors: Mohammad Javad Hosseini, Andrey Petrov, Alex Fabrikant, Annie Louis,
- Abstract summary: Natural Language Inference (NLI) remains an important benchmark task for LLMs.
We show that synthetic high-quality datasets can adapt NLI models for zero-shot use in downstream applications.
We show that models trained on this data have the best generalization to completely new downstream test settings.
- Score: 13.840374911669167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7\%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation [8.013158752919722]
Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data.
We propose a paradigm shift named textbfNOMAD by investigating how to specifically train models for data generation.
arXiv Detail & Related papers (2024-10-27T07:38:39Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models [15.991777903345575]
Large language models can generalize to novel downstream tasks with relatively few labeled examples.
Alternatively, smaller models can solve specific tasks if fine-tuned with enough labeled examples.
We study synthetic data generation of fine-tuning training data via fine-tuned teacher LLMs to improve the downstream performance of much smaller models.
arXiv Detail & Related papers (2023-10-02T11:49:05Z) - ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data.
We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z) - Falsesum: Generating Document-level NLI Examples for Recognizing Factual
Inconsistency in Summarization [63.21819285337555]
We show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples.
We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries.
We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.
arXiv Detail & Related papers (2022-05-12T10:43:42Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.