Generating Synthetic Text Data to Evaluate Causal Inference Methods
- URL: http://arxiv.org/abs/2102.05638v1
- Date: Wed, 10 Feb 2021 18:53:11 GMT
- Title: Generating Synthetic Text Data to Evaluate Causal Inference Methods
- Authors: Zach Wood-Doughty, Ilya Shpitser, Mark Dredze
- Abstract summary: We develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects.
We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data.
- Score: 23.330942019150786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Drawing causal conclusions from observational data requires making
assumptions about the true data-generating process. Causal inference research
typically considers low-dimensional data, such as categorical or numerical
fields in structured medical records. High-dimensional and unstructured data
such as natural language complicates the evaluation of causal inference
methods; such evaluations rely on synthetic datasets with known causal effects.
Models for natural language generation have been widely studied and perform
well empirically. However, existing methods not immediately applicable to
producing synthetic datasets for causal evaluations, as they do not allow for
quantifying a causal effect on the text itself. In this work, we develop a
framework for adapting existing generation models to produce synthetic text
datasets with known causal effects. We use this framework to perform an
empirical comparison of four recently-proposed methods for estimating causal
effects from text data. We release our code and synthetic datasets.
Related papers
- Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition [49.38757847011105]
computational fallacy recognition faces challenges due to diverse genres, domains, and types of fallacies found in datasets.
We aim to enhance existing models for fallacy recognition by incorporating additional context and by leveraging large language models to generate synthetic data.
Our evaluation results demonstrate consistent improvements across fallacy types, datasets, and generators.
arXiv Detail & Related papers (2023-11-16T04:17:47Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Boosting Synthetic Data Generation with Effective Nonlinear Causal
Discovery [11.81479419498206]
In software testing, data privacy, imbalanced learning, and artificial intelligence explanation, it is crucial to generate plausible data samples.
A common assumption of approaches widely used for data generation is the independence of the features.
We propose a synthetic dataset generator that can discover nonlinear causalities among the variables and use them at generation time.
arXiv Detail & Related papers (2023-01-18T10:54:06Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Evaluating Causal Inference Methods [0.4588028371034407]
We introduce a deep generative model-based framework, Credence, to validate causal inference methods.
Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods.
arXiv Detail & Related papers (2022-02-09T00:21:22Z) - Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial
Networks [7.232789848964222]
We propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic data.
Experiments on both simulated datasets and real datasets demonstrate the better performance of our method.
arXiv Detail & Related papers (2021-04-21T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.