Improve Fidelity and Utility of Synthetic Credit Card Transaction Time
Series from Data-centric Perspective
- URL: http://arxiv.org/abs/2401.00965v1
- Date: Mon, 1 Jan 2024 22:34:14 GMT
- Title: Improve Fidelity and Utility of Synthetic Credit Card Transaction Time
Series from Data-centric Perspective
- Authors: Din-Yin Hsieh, Chi-Hua Wang, Guang Cheng
- Abstract summary: We focus on attaining both high fidelity to actual data and optimal utility for machine learning tasks.
We introduce five pre-processing schemas to enhance the training of the Conditional Probabilistic Auto-Regressive Model.
Our attention shifts to training fraud detection models tailored for time-series data, evaluating the utility of the synthetic data.
- Score: 10.996626204702189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploring generative model training for synthetic tabular data, specifically
in sequential contexts such as credit card transaction data, presents
significant challenges. This paper addresses these challenges, focusing on
attaining both high fidelity to actual data and optimal utility for machine
learning tasks. We introduce five pre-processing schemas to enhance the
training of the Conditional Probabilistic Auto-Regressive Model (CPAR),
demonstrating incremental improvements in the synthetic data's fidelity and
utility. Upon achieving satisfactory fidelity levels, our attention shifts to
training fraud detection models tailored for time-series data, evaluating the
utility of the synthetic data. Our findings offer valuable insights and
practical guidelines for synthetic data practitioners in the finance sector,
transitioning from real to synthetic datasets for training purposes, and
illuminating broader methodologies for synthesizing credit card transaction
time series.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks [5.0243930429558885]
This paper introduces Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers.
At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information.
The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases.
arXiv Detail & Related papers (2024-07-22T10:31:07Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments [42.8983261737774]
We investigate the efficacy of data augmentation techniques leveraging SoTA AI-based methods for synthetic data generation.
Inspired by practical and experimental motivations, we explore fusion strategies of real and synthetic data to improve forecasting models.
arXiv Detail & Related papers (2024-06-07T12:36:31Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - A Study on Improving Realism of Synthetic Data for Machine Learning [6.806559012493756]
This work aims to train and evaluate a synthetic-to-real generative model that transforms the synthetic renderings into more realistic styles on general-purpose datasets conditioned with unlabeled real-world data.
arXiv Detail & Related papers (2023-04-24T21:41:54Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.