Sequential Data Augmentation for Generative Recommendation
- URL: http://arxiv.org/abs/2509.13648v1
- Date: Wed, 17 Sep 2025 02:53:25 GMT
- Title: Sequential Data Augmentation for Generative Recommendation
- Authors: Geon Lee, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, Liam Collins,
- Abstract summary: Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences.<n>Data augmentation, the process of constructing training data from user interaction histories, is a critical yet underexplored factor in training these models.<n>We propose GenPAS, a principled framework that models augmentation as a sampling process and enables flexible control of the resulting training distribution.<n>Our experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies.
- Score: 54.765568804267645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation.
Related papers
- PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization [12.15086255236961]
We show that the performance of such augmentation-based methods in the target domains universally fluctuates during training.<n>We propose a novel generalization method, coined.<n>Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data.
arXiv Detail & Related papers (2025-05-19T06:01:11Z) - Predicting Practically? Domain Generalization for Predictive Analytics in Real-world Environments [18.086130222010496]
We propose a novel domain generalization method tailored to handle complex distribution shifts.<n>Our method builds upon the Distributionally Robust Optimization framework, optimizing model performance over a set of hypothetical worst-case distributions.<n>We discuss the broader implications of our method for advancing Information Systems (IS) design research.
arXiv Detail & Related papers (2025-03-05T11:21:37Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Pre-trained Recommender Systems: A Causal Debiasing Perspective [19.712997823535066]
We develop a generic recommender that captures universal interaction patterns by training on generic user-item interaction data extracted from different domains.
Our empirical studies show that the proposed model could significantly improve the recommendation performance in zero- and few-shot learning settings.
arXiv Detail & Related papers (2023-10-30T03:37:32Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - Data Augmentation Strategies for Improving Sequential Recommender
Systems [7.986899327513767]
Sequential recommender systems have recently achieved significant performance improvements with the exploitation of deep learning (DL) based methods.
We propose a set of data augmentation strategies, all of which transform original item sequences in the way of direct corruption.
Experiments on the latest DL-based model show that applying data augmentation can help the model generalize better.
arXiv Detail & Related papers (2022-03-26T09:58:14Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting.
G-DAUGC consistently outperforms existing data augmentation methods based on back-translation.
Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.