Related papers: M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation

M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation

URL: http://arxiv.org/abs/2508.00452v1
Date: Fri, 01 Aug 2025 09:16:26 GMT
Title: M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation
Authors: Chuan He, Yongchao Liu, Qiang Li, Wenliang Zhong, Chuntao Hong, Xinwei Yao,
Abstract summary: Cold-start item recommendation is a significant challenge in recommendation systems.<n>Existing methods leverage multi-modal content to alleviate the cold-start issue.<n>We propose a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features.
Score: 14.644213412218742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M^2VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.

Related papers

CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation [23.478610632707728]
We propose a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation.<n>At its core, CAMMSR introduces a category-guided attentive mixture of experts module, which learns specialized item representations from multiple perspectives.<n>Experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-04T17:39:35Z)
MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation [16.81485354427923]
We propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer.<n> MMQ unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
arXiv Detail & Related papers (2025-08-21T06:15:49Z)
MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching [54.740256498985026]
Keypoint detection and description methods often struggle with multimodal data.<n>We propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching.
arXiv Detail & Related papers (2025-01-20T06:56:30Z)
MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.<n>Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.<n>We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z)
DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification [25.781336502845395]
Multi-modal object ReIDentification aims to retrieve specific objects by combining complementary information from multiple modalities.<n>We propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts.
arXiv Detail & Related papers (2024-12-14T02:36:56Z)
Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities.<n>We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec.<n>Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z)
Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation [25.516648802281626]
We propose a novel structure called Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME)<n>We leverage sub-embeddings from each head in the last multi-head attention layer to predict the next item separately.<n>A gating mechanism integrates recommendations from each head and dynamically determines their importance.
arXiv Detail & Related papers (2024-11-03T06:47:45Z)
Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z)
A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation [9.720586396359906]
We argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling. We propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which leverages a multi-way transformer to extract aligned multi-modal features. We show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
arXiv Detail & Related papers (2024-07-29T11:04:31Z)
MMGRec: Multimodal Generative Recommendation with Transformer Model [81.61896141495144]
MMGRec aims to introduce a generative paradigm into multimodal recommendation. We first devise a hierarchical quantization method Graph CF-RQVAE to assign Rec-ID for each item from its multimodal information. We then train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences.
arXiv Detail & Related papers (2024-04-25T12:11:27Z)
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
MM-GEF: Multi-modal representation meet collaborative filtering [43.88159639990081]
We propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion. MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals.
arXiv Detail & Related papers (2023-08-14T15:47:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.