Effects of Using Synthetic Data on Deep Recommender Models' Performance
- URL: http://arxiv.org/abs/2406.18286v1
- Date: Wed, 26 Jun 2024 12:14:10 GMT
- Title: Effects of Using Synthetic Data on Deep Recommender Models' Performance
- Authors: Fatih Cihan Taskin, Ilknur Akcay, Muhammed Pesen, Said Aldemir, Ipek Iraz Esin, Furkan Durmus,
- Abstract summary: This study investigates the effectiveness of synthetic data generation in addressing data imbalances within recommender systems.
Our results show that the inclusion of generated negative samples consistently improves the Area Under the Curve (AUC) scores.
- Score: 0.0
- License:
- Abstract: Recommender systems are essential for enhancing user experiences by suggesting items based on individual preferences. However, these systems frequently face the challenge of data imbalance, characterized by a predominance of negative interactions over positive ones. This imbalance can result in biased recommendations favoring popular items. This study investigates the effectiveness of synthetic data generation in addressing data imbalances within recommender systems. Six different methods were used to generate synthetic data. Our experimental approach involved generating synthetic data using these methods and integrating the generated samples into the original dataset. Our results show that the inclusion of generated negative samples consistently improves the Area Under the Curve (AUC) scores. The significant impact of synthetic negative samples highlights the potential of data augmentation strategies to address issues of data sparsity and imbalance, ultimately leading to improved performance of recommender systems.
Related papers
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Synthetic Information towards Maximum Posterior Ratio for deep learning
on Imbalanced Data [1.7495515703051119]
We propose a technique for data balancing by generating synthetic data for the minority class.
Our method prioritizes balancing the informative regions by identifying high entropy samples.
Our experimental results on forty-one datasets demonstrate the superior performance of our technique.
arXiv Detail & Related papers (2024-01-05T01:08:26Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic Alone: Exploring the Dark Side of Synthetic Data for
Grammatical Error Correction [5.586798679167892]
Data-centric AI approach aims to enhance the model performance without modifying the model.
Data quality control method has a positive impact on models trained with real-world data.
A negative impact is observed in models trained solely on synthetic data.
arXiv Detail & Related papers (2023-06-26T01:40:28Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z) - Investigating Bias with a Synthetic Data Generator: Empirical Evidence
and Philosophical Interpretation [66.64736150040093]
Machine learning applications are becoming increasingly pervasive in our society.
Risk is that they will systematically spread the bias embedded in data.
We propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations.
arXiv Detail & Related papers (2022-09-13T11:18:50Z) - Generating Negative Samples for Sequential Recommendation [83.60655196391855]
We propose to Generate Negative Samples (items) for Sequential Recommendation (SR)
A negative item is sampled at each time step based on the current SR model's learned user preferences toward items.
Experiments on four public datasets verify the importance of providing high-quality negative samples for SR.
arXiv Detail & Related papers (2022-08-07T05:44:13Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.