Related papers: Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations

Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations

URL: http://arxiv.org/abs/2602.08067v1
Date: Sun, 08 Feb 2026 17:48:43 GMT
Title: Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations
Authors: Chenglei Shen, Yi Zhan, Weijie Yu, Xiao Zhang, Jun Xu,
Abstract summary: HyperBandit+ is a novel bandit policy that integrates a time-aware hypernetwork to adapt to time-varying user preferences.<n>We show that HyperBandit+ consistently outperforms state-of-the-art baselines in terms of accumulated rewards.
Score: 10.190789989569085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In real-world streaming recommender systems, user preferences evolve dynamically over time. Existing bandit-based methods treat time merely as a timestamp, neglecting its explicit relationship with user preferences and leading to suboptimal performance. Moreover, online learning methods often suffer from inefficient exploration-exploitation during the early online phase. To address these issues, we propose HyperBandit+, a novel contextual bandit policy that integrates a time-aware hypernetwork to adapt to time-varying user preferences and employs a large language model-assisted warm-start mechanism (LLM Start) to enhance exploration-exploitation efficiency in the early online phase. Specifically, HyperBandit+ leverages a neural network that takes time features as input and generates parameters for estimating time-varying rewards by capturing the correlation between time and user preferences. Additionally, the LLM Start mechanism employs multi-step data augmentation to simulate realistic interaction data for effective offline learning, providing warm-start parameters for the bandit policy in the early online phase. To meet real-time streaming recommendation demands, we adopt low-rank factorization to reduce hypernetwork training complexity. Theoretically, we rigorously establish a sublinear regret upper bound that accounts for both the hypernetwork and the LLM warm-start mechanism. Extensive experiments on real-world datasets demonstrate that HyperBandit+ consistently outperforms state-of-the-art baselines in terms of accumulated rewards.

Related papers

STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [65.18157595903124]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
Offline Critic-Guided Diffusion Policy for Multi-User Delay-Constrained Scheduling [29.431945795881976]
We propose a novel offline reinforcement learning-based algorithm, named underlineScheduling.<n>It learns efficient scheduling policies purely from pre-collected emphoffline data.<n>We show that SOCD is resilient to various system dynamics, including partially observable and large-scale environments.
arXiv Detail & Related papers (2025-01-22T15:13:21Z)
PreMixer: MLP-Based Pre-training Enhanced MLP-Mixers for Large-scale Traffic Forecasting [30.055634767677823]
In urban computing, precise and swift forecasting of time series data from traffic networks is crucial.<n>Current research limitations because of inherent inefficiency of model and their unsuitability for large-scale traffic applications due to model complexity.<n>This paper proposes a novel framework, named PreMixer, designed to bridge this gap. It features a predictive model and a pre-training mechanism, both based on the principles of Multi-Layer Perceptrons (MLP)<n>Our framework achieves comparable state-of-theart performance while maintaining high computational efficiency, as verified by extensive experiments on large-scale traffic datasets.
arXiv Detail & Related papers (2024-12-18T08:35:40Z)
Reprogramming Foundational Large Language Models(LLMs) for Enterprise Adoption for Spatio-Temporal Forecasting Applications: Unveiling a New Era in Copilot-Guided Cross-Modal Time Series Representation Learning [0.0]
patio-temporal forecasting plays a crucial role in various sectors such as transportation systems, logistics, and supply chain management. We introduce a hybrid approach that combines the strengths of open-source large and small-scale language models (LLMs and LMs) with traditional forecasting methods.
arXiv Detail & Related papers (2024-08-26T16:11:53Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
HyperBandit: Contextual Bandit with Hypernewtork for Time-Varying User Preferences in Streaming Recommendation [11.908362247624131]
Existing streaming recommender models only consider time as a timestamp. We propose a contextual bandit approach using hypernetwork, called HyperBandit. We show that the proposed HyperBandit consistently outperforms the state-of-the-art baselines in terms of accumulated rewards.
arXiv Detail & Related papers (2023-08-14T14:04:57Z)
Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints. Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z)
Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. We introduce a new RL formulation for text generation from the soft Q-learning perspective. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets [84.94748183816547]
We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.
arXiv Detail & Related papers (2020-06-16T17:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.