Cold-Starts in Generative Recommendation: A Reproducibility Study
Abstract Overview
This paper presents a systematic reproducibility study of generative recommendation under unified cold-start protocols, covering both new-user and new-item settings. The authors reproduce representative generative recommenders alongside traditional sequential baselines (SASRec, GRU4Rec) and evaluate them on three datasets: Amazon-Toys, MicroLens, and Steam. The study isolates three design dimensions that are often entangled in prior work: model scale, item identifier design, and training strategy (SFT vs. reinforcement learning). Experiments reveal that cold-start behavior is highly asymmetric, with item cold-start causing dramatic performance drops while user cold-start leads to only moderate degradation. The work provides a controlled benchmarking framework and actionable guidance for improving generalization in generative recommendation systems under cold-start conditions.
Novelty
The primary contribution is a controlled, reproducible benchmarking framework that treats cold-start as the central evaluation setting for generative recommendation, rather than proposing a new model. The study systematically isolates how model scale, identifier design (atomic, textual, and semantic codes with different quantization schemes), and reinforcement-learning-based training individually affect cold-start generalization under unified protocols across three datasets.
Results
Empirically, item cold-start performance drops dramatically across all models (e.g., Recall@10 falling from ~0.08 to near zero for many methods), while user cold-start degradation is comparatively moderate. Textual identifiers substantially improve unseen-item recommendation but degrade warm-start and user cold-start performance, whereas compositional semantic codes (e.g., OPQ) improve item cold-start robustness without sacrificing warm-start accuracy. Scaling model size from Flan-T5-small to Flan-T5-xl yields consistent but marginal gains that do not close the cold-start gap, and adding reinforcement learning slightly reduces performance under cold-start conditions (e.g., up to −6.5% Recall@10 for item cold-start on Amazon-Toys).
Key Points
- Item cold-start is substantially harder than user cold-start across all reproduced methods and datasets, with many models experiencing near-complete performance collapse for unseen items while maintaining moderate performance for unseen users.
- Identifier design is a decisive factor: textual identifiers markedly improve unseen-item recommendation but degrade warm-start and user cold-start performance, while compositional semantic codes (e.g., OPQ) offer better robustness across all settings.
- Scaling model size and adding reinforcement learning provide limited cold-start benefits; RL can even slightly degrade robustness (up to −6.5% on item cold-start), indicating that these design choices alone do not address the distribution-shift challenge in generative recommendation.