2026-03-31 Daily Report: Cold-Starts in Generative Recommendation: A Reproducibility Study

Cold-Starts in Generative Recommendation: A Reproducibility Study

Authors Zhen Zhang, Jujia Zhao, Xinyu Ma, Xin Xin, Maarten de Rijke, Zhaochun Ren

Affiliations Leiden University / University of Amsterdam / Shandong University / Baidu

Categories Task / Cold-Start Recommendation / Open-world platform challenge, Evaluation / Reproducibility Study / Unified protocol benchmarking, Method / Generative Recommendation / Systematic analysis

License CC BY 4.0

Abstract Overview

This paper presents a systematic reproducibility study of generative recommendation under unified cold-start protocols, covering both new-user and new-item settings. The authors reproduce representative generative recommenders alongside traditional sequential baselines (SASRec, GRU4Rec) and evaluate them on three datasets: Amazon-Toys, MicroLens, and Steam. The study isolates three design dimensions that are often entangled in prior work: model scale, item identifier design, and training strategy (SFT vs. reinforcement learning). Experiments reveal that cold-start behavior is highly asymmetric, with item cold-start causing dramatic performance drops while user cold-start leads to only moderate degradation. The work provides a controlled benchmarking framework and actionable guidance for improving generalization in generative recommendation systems under cold-start conditions.

Novelty

The primary contribution is a controlled, reproducible benchmarking framework that treats cold-start as the central evaluation setting for generative recommendation, rather than proposing a new model. The study systematically isolates how model scale, identifier design (atomic, textual, and semantic codes with different quantization schemes), and reinforcement-learning-based training individually affect cold-start generalization under unified protocols across three datasets.

Results

Empirically, item cold-start performance drops dramatically across all models (e.g., Recall@10 falling from ~0.08 to near zero for many methods), while user cold-start degradation is comparatively moderate. Textual identifiers substantially improve unseen-item recommendation but degrade warm-start and user cold-start performance, whereas compositional semantic codes (e.g., OPQ) improve item cold-start robustness without sacrificing warm-start accuracy. Scaling model size from Flan-T5-small to Flan-T5-xl yields consistent but marginal gains that do not close the cold-start gap, and adding reinforcement learning slightly reduces performance under cold-start conditions (e.g., up to −6.5% Recall@10 for item cold-start on Amazon-Toys).

Key Points

Item cold-start is substantially harder than user cold-start across all reproduced methods and datasets, with many models experiencing near-complete performance collapse for unseen items while maintaining moderate performance for unseen users.
Identifier design is a decisive factor: textual identifiers markedly improve unseen-item recommendation but degrade warm-start and user cold-start performance, while compositional semantic codes (e.g., OPQ) offer better robustness across all settings.
Scaling model size and adding reinforcement learning provide limited cold-start benefits; RL can even slightly degrade robustness (up to −6.5% on item cold-start), indicating that these design choices alone do not address the distribution-shift challenge in generative recommendation.

References

arXiv: https://arxiv.org/abs/2603.29845v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2603.29845v1
Project: https://anonymous.4open.science/r/ColdGenrec-0DEC

Project