Shape of synth to come: Why we should use synthetic data for English
surface realization
- URL: http://arxiv.org/abs/2005.02693v1
- Date: Wed, 6 May 2020 10:00:55 GMT
- Title: Shape of synth to come: Why we should use synthetic data for English
surface realization
- Authors: Henry Elder and Robert Burke and Alexander O'Connor and Jennifer
Foster
- Abstract summary: In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data.
We show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect.
We argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.
- Score: 72.62356061765976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language
Generation shared tasks with the goal of exploring approaches to surface
realization from Universal-Dependency-like trees to surface strings for several
languages. In the 2018 shared task there was very little difference in the
absolute performance of systems trained with and without additional,
synthetically created data, and a new rule prohibiting the use of synthetic
data was introduced for the 2019 shared task. Contrary to the findings of the
2018 shared task, we show, in experiments on the English 2018 dataset, that the
use of synthetic data can have a substantial positive effect - an improvement
of almost 8 BLEU points for a previously state-of-the-art system. We analyse
the effects of synthetic data, and we argue that its use should be encouraged
rather than prohibited so that future research efforts continue to explore
systems that can take advantage of such data.
Related papers
- Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training [1.8817715864806608]
We study the effect of synthetic story data in language pre-training using TinyStories.
We train GPT-Neo models on subsets of TinyStories, while varying the amount of available data.
We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story.
arXiv Detail & Related papers (2024-10-20T11:47:17Z) - Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research [0.0]
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics.
Access to these datasets is often restricted due to costs and platform regulations.
This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms.
arXiv Detail & Related papers (2024-07-11T09:12:39Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - Generating Faithful Synthetic Data with Large Language Models: A Case
Study in Computational Social Science [13.854807858791652]
We tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about.
We study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation.
We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.
arXiv Detail & Related papers (2023-05-24T11:27:59Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real
Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis.
We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints.
Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.