CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation
- URL: http://arxiv.org/abs/2603.04320v1
- Date: Wed, 04 Mar 2026 17:39:35 GMT
- Title: CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation
- Authors: Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, Yijie Li, Jianheng Tang, Yunhuai Liu, Edith C. H. Ngai,
- Abstract summary: We propose a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation.<n>At its core, CAMMSR introduces a category-guided attentive mixture of experts module, which learns specialized item representations from multiple perspectives.<n>Experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines.
- Score: 23.478610632707728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation.
Related papers
- PRISM: Personalized Recommendation via Information Synergy Module [12.797662213207936]
PRISM is a plug-and-play framework for sequential recommendation (SR)<n>It decomposes multimodal information into unique, redundant, and synergistic components.<n>Experiments on four datasets and three SR backbones demonstrate its effectiveness and versatility.
arXiv Detail & Related papers (2026-01-16T02:17:54Z) - Structurally Refined Graph Transformer for Multimodal Recommendation [13.296555757708298]
We present SRGFormer, a structurally optimized multimodal recommendation model.<n>By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users.<n>Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items.
arXiv Detail & Related papers (2025-11-01T15:18:00Z) - Revisiting scalable sequential recommendation with Multi-Embedding Approach and Mixture-of-Experts [15.976682531132676]
We propose Fuxi-MME, a framework that integrates a multi-embedding strategy with a Mixture-of-Experts (MoE) architecture.<n>Specifically, to efficiently capture diverse item characteristics in a decoupled manner, we decompose the conventional single embedding matrix into several lower-dimensional embedding matrices.
arXiv Detail & Related papers (2025-10-29T08:42:15Z) - Semantic Gaussian Mixture Variational Autoencoder for Sequential Recommendation [49.492451800322144]
We propose a novel VAE-based Sequential Recommendation model named SIGMA.<n>For multi-interest elicitation, SIGMA includes a probabilistic multi-interest extraction module.<n>Experiments on public datasets demonstrate the effectiveness of SIGMA.
arXiv Detail & Related papers (2025-02-22T08:29:52Z) - LLM-based Bi-level Multi-interest Learning Framework for Sequential Recommendation [54.396000434574454]
We propose a novel multi-interest SR framework combining implicit behavioral and explicit semantic perspectives.<n>It includes two modules: the Implicit Behavioral Interest Module and the Explicit Semantic Interest Module.<n>Experiments on four real-world datasets validate the framework's effectiveness and practicality.
arXiv Detail & Related papers (2024-11-14T13:00:23Z) - BiVRec: Bidirectional View-based Multimodal Sequential Recommendation [55.87443627659778]
We propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views.
BivRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
arXiv Detail & Related papers (2024-02-27T09:10:41Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MISSRec: Pre-training and Transferring Multi-modal Interest-aware
Sequence Representation for Recommendation [61.45986275328629]
We propose MISSRec, a multi-modal pre-training and transfer learning framework for sequential recommendation.
On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal user interests.
On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation.
arXiv Detail & Related papers (2023-08-22T04:06:56Z) - Knowledge-Enhanced Hierarchical Graph Transformer Network for
Multi-Behavior Recommendation [56.12499090935242]
This work proposes a Knowledge-Enhanced Hierarchical Graph Transformer Network (KHGT) to investigate multi-typed interactive patterns between users and items in recommender systems.
KHGT is built upon a graph-structured neural architecture to capture type-specific behavior characteristics.
We show that KHGT consistently outperforms many state-of-the-art recommendation methods across various evaluation settings.
arXiv Detail & Related papers (2021-10-08T09:44:00Z) - Graph Meta Network for Multi-Behavior Recommendation [24.251784947151755]
We propose a Multi-Behavior recommendation framework with Graph Meta Network to incorporate the multi-behavior pattern modeling into a meta-learning paradigm.
Our developed MB-GMN empowers the user-item interaction learning with the capability of uncovering type-dependent behavior representations.
arXiv Detail & Related papers (2021-10-08T08:38:27Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.