Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts
- URL: http://arxiv.org/abs/2508.05993v1
- Date: Fri, 08 Aug 2025 04:00:05 GMT
- Title: Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts
- Authors: Yunke Qu, Liang Qu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin,
- Abstract summary: multimodal streaming recommender systems are widely deployed in real-world applications, where user interests shift over time.<n>We propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation.<n>XSMoE attaches lightweight side-tuning modules to frozen pretrained encoders and incrementally expands them in response to evolving user feedback.
- Score: 40.79898677069334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users' latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates. To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency.
Related papers
- GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z) - Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations [1.3702600718499687]
We develop an online adaptation mechanism that incorporates new user interactions through lightweight modules.<n>We create a unified representation that seamlessly combines collaborative signals with visual and audio features.<n>Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.
arXiv Detail & Related papers (2025-10-02T02:43:24Z) - Multi-modal Adaptive Mixture of Experts for Cold-start Recommendation [1.9967512860886603]
MAMEX is a novel framework for multimodal cold-start recommendation.<n>It dynamically leverages latent representation from different modalities.<n>Experiments show MAMEX outperforms state-of-the-art methods in cold-start scenarios.
arXiv Detail & Related papers (2025-08-11T14:47:14Z) - M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation [14.644213412218742]
Cold-start item recommendation is a significant challenge in recommendation systems.<n>Existing methods leverage multi-modal content to alleviate the cold-start issue.<n>We propose a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features.
arXiv Detail & Related papers (2025-08-01T09:16:26Z) - Multi-agents based User Values Mining for Recommendation [52.26100802380767]
We propose a zero-shot multi-LLM collaborative framework for effective and accurate user value extraction.<n>We apply text summarization techniques to condense item content while preserving essential meaning.<n>To mitigate hallucinations, we introduce two specialized agent roles: evaluators and supervisors.
arXiv Detail & Related papers (2025-05-02T04:01:31Z) - HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression [33.34435467588446]
HistLLM is an innovative framework that integrates textual and visual features through a User History.<n> Module (UHEM), compressing user history interactions into a single token representation.<n>Extensive experiments demonstrate the effectiveness and efficiency of our proposed mechanism.
arXiv Detail & Related papers (2025-04-14T12:01:11Z) - Enhancing User Intent for Recommendation Systems via Large Language Models [0.0]
DUIP is a novel framework that combines LSTM networks with Large Language Models (LLMs) to dynamically capture user intent and generate personalized item recommendations.<n>Our findings suggest that DUIP is a promising approach for next-generation recommendation systems, with potential for further improvements in cross-modal recommendations and scalability.
arXiv Detail & Related papers (2025-01-18T20:35:03Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - LLM-ESR: Large Language Models Enhancement for Long-tailed Sequential Recommendation [58.04939553630209]
In real-world systems, most users interact with only a handful of items, while the majority of items are seldom consumed.
These two issues, known as the long-tail user and long-tail item challenges, often pose difficulties for existing Sequential Recommendation systems.
We propose the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR) to address these challenges.
arXiv Detail & Related papers (2024-05-31T07:24:42Z) - MISSRec: Pre-training and Transferring Multi-modal Interest-aware
Sequence Representation for Recommendation [61.45986275328629]
We propose MISSRec, a multi-modal pre-training and transfer learning framework for sequential recommendation.
On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal user interests.
On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation.
arXiv Detail & Related papers (2023-08-22T04:06:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.