Related papers: Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks

Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks

URL: http://arxiv.org/abs/2303.01297v1
Date: Thu, 2 Mar 2023 14:23:27 GMT
Title: Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks
Authors: Jes\'us Bobadilla and Abraham Guti\'errez and Raciel Yera and Luis Mart\'inez
Abstract summary: Research and education in machine learning needs diverse, representative, and open datasets to handle the necessary training, validation, and testing tasks. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets.
Score: 1.290382979353427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Research and education in machine learning needs diverse, representative, and open datasets that contain sufficient samples to handle the necessary training, validation, and testing tasks. Currently, the Recommender Systems area includes a large number of subfields in which accuracy and beyond accuracy quality measures are continuously improved. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets in a parameterized way, by selecting their preferred number of users, items, samples, and stochastic variability. This parameterization cannot be made using regular GANs. Our GAN model is fed with dense, short, and continuous embedding representations of items and users, instead of sparse, large, and discrete vectors, to make an accurate and quick learning, compared to the traditional approach based on large and sparse input vectors. The proposed architecture includes a DeepMF model to extract the dense user and item embeddings, as well as a clustering process to convert from the dense GAN generated samples to the discrete and sparse ones, necessary to create each required synthetic dataset. The results of three different source datasets show adequate distributions and expected quality values and evolutions on the generated datasets compared to the source ones. Synthetic datasets and source codes are available to researchers.

Related papers

Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Testing Deep Learning Recommender Systems Models on Synthetic GAN-Generated Datasets [0.27624021966289597]
The published method Generative Adversarial Networks for Recommender Systems (GANRS) allows generating data sets for collaborative filtering recommendation systems. We have tested the GANRS method by creating multiple synthetic datasets from three different real datasets taken as a source. We have also selected six state-of-the-art collaborative filtering deep learning models to test both their comparative performance and the GANRS method.
arXiv Detail & Related papers (2024-10-23T08:09:48Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Quality-Diversity Generative Sampling for Learning with Synthetic Data [18.642540152362237]
Generative models can serve as surrogates for some real data sources by creating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space.
arXiv Detail & Related papers (2023-12-22T01:43:27Z)
A Configurable Library for Generating and Manipulating Maze Datasets [0.9268994664916388]
Mazes serve as an excellent testbed due to varied generation algorithms. We present $textttmaze-dataset$, a comprehensive library for generating, processing, and visualizing datasets consisting of maze-solving tasks.
arXiv Detail & Related papers (2023-09-19T10:20:11Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Distributed Traffic Synthesis and Classification in Edge Networks: A Federated Self-supervised Learning Approach [83.2160310392168]
This paper proposes FS-GAN to support automatic traffic analysis and synthesis over a large number of heterogeneous datasets. FS-GAN is composed of multiple distributed Generative Adversarial Networks (GANs) FS-GAN can classify data of unknown types of service and create synthetic samples that capture the traffic distribution of the unknown types.
arXiv Detail & Related papers (2023-02-01T03:23:11Z)
Deep Variational Models for Collaborative Filtering-based Recommender Systems [63.995130144110156]
Deep learning provides accurate collaborative filtering models to improve recommender system results. Our proposed models apply the variational concept to injectity in the latent space of the deep architecture. Results show the superiority of the proposed approach in scenarios where the variational enrichment exceeds the injected noise effect.
arXiv Detail & Related papers (2021-07-27T08:59:39Z)
Differential-Critic GAN: Generating What You Want by a Cue of Preferences [34.25181656518662]
We propose Differential-Critic Generative Adversarial Network (DiCGAN) to learn the distribution of user-desired data. DiCGAN generates desired data that meets the user's expectations and can assist in designing biological products with desired properties.
arXiv Detail & Related papers (2021-07-14T13:44:07Z)
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods. These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z)
SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources [8.350531869939351]
We study synthetic data generation task called downscaling. We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula) We make four key contributions in this work.
arXiv Detail & Related papers (2020-09-20T16:36:25Z)
Lessons Learned from the Training of GANs on Artificial Datasets [0.0]
Generative Adversarial Networks (GANs) have made great progress in synthesizing realistic images in recent years. GANs are prone to underfitting or overfitting, making the analysis of them difficult and constrained. We train them on artificial datasets where there are infinitely many samples and the real data distributions are simple. We find that training mixtures of GANs leads to more performance gain compared to increasing the network depth or width.
arXiv Detail & Related papers (2020-07-13T14:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.