Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
- URL: http://arxiv.org/abs/2409.08239v1
- Date: Thu, 12 Sep 2024 17:39:08 GMT
- Title: Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
- Authors: Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli,
- Abstract summary: We propose Source2 Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations.
Source2 Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources.
We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool question answering (TQA)
- Score: 38.30192495271699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.
Related papers
- Question: How do Large Language Models perform on the Question Answering tasks? Answer: [0.0]
Large Language Models (LLMs) have been showing promising results for various NLP-tasks without the explicit need to be trained for these tasks by using few-shot or zero-shot prompting techniques.
We propose a comprehensive performance comparison between smaller fine-tuned models and out-of-the-box instruction-following LLMs on the Stanford Question Answering dataset 2.0 (SQuAD2)
Our results show that smaller, fine-tuned models outperform current State-Of-The-Art (SOTA) LLMs on the fine-tuned task, but recent SOTA models are able to close this gap on the out
arXiv Detail & Related papers (2024-12-17T13:19:38Z) - AutoReason: Automatic Few-Shot Reasoning Decomposition [0.0]
Chain of Thought (CoT) was introduced in recent research as a method for improving step-by-step reasoning in Large Language Models.
We propose a system to automatically generate rationales using CoT.
Our method improves multi-step implicit reasoning capabilities by decomposing the implicit query into several explicit questions.
arXiv Detail & Related papers (2024-12-09T20:35:39Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch [28.519536719973317]
ScaleQuest is a scalable and novel data synthesis method.
It generates questions from scratch without the need for seed data with complex augmentation constraints.
It can universally increase the performance of mainstream open-source models.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA [9.659820850719413]
We leverage Large Language Models (LLMs), which have shown to have strong reasoning ability, as an automatic data annotator.
Key innovation in our method lies in the Synthesize Step-by-Step strategy.
We significantly enhance the chart VQA models, achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets.
arXiv Detail & Related papers (2024-03-25T03:02:27Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - When in Doubt, Ask: Generating Answerable and Unanswerable Questions,
Unsupervised [0.0]
Question Answering (QA) is key for making possible a robust communication between human and machine.
Modern language models used for QA have surpassed the human-performance in several essential tasks.
This paper studies augmenting human-made datasets with synthetic data as a way of surmounting this problem.
arXiv Detail & Related papers (2020-10-04T15:56:44Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.