Related papers: Improving Text Relationship Modeling with Artificial Data

Improving Text Relationship Modeling with Artificial Data

URL: http://arxiv.org/abs/2010.14640v1
Date: Tue, 27 Oct 2020 22:04:54 GMT
Title: Improving Text Relationship Modeling with Artificial Data
Authors: Peter Organisciak, Maggie Ryan
Abstract summary: We apply and evaluate a synthetic data approach to relationship classification in digital libraries. We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%.
Score: 0.07614628596146598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data augmentation uses artificially-created examples to support supervised machine learning, adding robustness to the resulting models and helping to account for limited availability of labelled data. We apply and evaluate a synthetic data approach to relationship classification in digital libraries, generating artificial books with relationships that are common in digital libraries but not easier inferred from existing metadata. We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%. Further, we consider the ability of synthetic data to learn a useful new text relationship class from fully artificial training data.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation. LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
From Measurement Instruments to Data: Leveraging Theory-Driven Synthetic Training Data for Classifying Social Constructs [2.0591508284285376]
We examine the potential of theory-driven synthetic training data for improving the measurement of social constructs. We show that synthetic data can be highly effective in reducing the need for labeled data in political topic classification.
arXiv Detail & Related papers (2024-10-16T14:42:23Z)
Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Generating Realistic Synthetic Relational Data through Graph Variational Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data. Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.