MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems
- URL: http://arxiv.org/abs/2508.15304v1
- Date: Thu, 21 Aug 2025 06:50:00 GMT
- Title: MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems
- Authors: Yuzhuo Dang, Xin Zhang, Zhiqiang Pan, Yuxiao Duan, Wanyu Chen, Fei Cai, Honghui Chen,
- Abstract summary: We propose a novel MLLM-driven multimodal recommendation framework with two item-item graph refinement strategies.<n> MLLMRec achieves the state-of-the-art performance with an average improvement of 38.53% over the best baselines.
- Score: 8.744074431975019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal recommendation typically combines the user behavioral data with the modal features of items to reveal user's preference, presenting superior performance compared to the conventional recommendations. However, existing methods still suffer from two key problems: (1) the initialization methods of user multimodal representations are either behavior-unperceived or noise-contaminated, and (2) the KNN-based item-item graph contains noisy edges with low similarities and lacks audience co-occurrence relationships. To address such issues, we propose MLLMRec, a novel MLLM-driven multimodal recommendation framework with two item-item graph refinement strategies. On the one hand, the item images are first converted into high-quality semantic descriptions using an MLLM, which are then fused with the textual metadata of items. Then, we construct a behavioral description list for each user and feed it into the MLLM to reason about the purified user preference containing interaction motivations. On the other hand, we design the threshold-controlled denoising and topology-aware enhancement strategies to refine the suboptimal item-item graph, thereby enhancing the item representation learning. Extensive experiments on three publicly available datasets demonstrate that MLLMRec achieves the state-of-the-art performance with an average improvement of 38.53% over the best baselines.
Related papers
- DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation [13.114773060703891]
We propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR)<n>For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs.<n>For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics.
arXiv Detail & Related papers (2026-02-14T10:42:56Z) - Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs [28.752042722391934]
Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions.<n>MME-SID integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse.<n>Extensive experiments on three public datasets validate the superior performance of MME-SID.
arXiv Detail & Related papers (2025-09-02T07:02:29Z) - MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning [57.42710816140401]
A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps.<n>Existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored.<n>This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.
arXiv Detail & Related papers (2025-07-24T07:03:11Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - LIBER: Lifelong User Behavior Modeling Based on Large Language Models [42.045535303737694]
We propose Lifelong User Behavior Modeling (LIBER) based on large language models.
LIBER has been deployed on Huawei's music recommendation service and achieved substantial improvements in users' play count and play time by 3.01% and 7.69%.
arXiv Detail & Related papers (2024-11-22T03:43:41Z) - LLM-based Bi-level Multi-interest Learning Framework for Sequential Recommendation [54.396000434574454]
We propose a novel multi-interest SR framework combining implicit behavioral and explicit semantic perspectives.<n>It includes two modules: the Implicit Behavioral Interest Module and the Explicit Semantic Interest Module.<n>Experiments on four real-world datasets validate the framework's effectiveness and practicality.
arXiv Detail & Related papers (2024-11-14T13:00:23Z) - Improved Diversity-Promoting Collaborative Metric Learning for Recommendation [127.08043409083687]
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems.
This paper focuses on a challenging scenario where a user has multiple categories of interests.
We propose a novel method called textitDiversity-Promoting Collaborative Metric Learning (DPCML)
arXiv Detail & Related papers (2024-09-02T07:44:48Z) - Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation [21.281471662696372]
We propose the Multimodal Large Language Model-enhanced Multimodaln Sequential Recommendation (MLLM-MSR) model.<n>To capture the dynamic user preference, we design a two-stage user preference summarization method.<n>We then employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences.
arXiv Detail & Related papers (2024-08-19T04:44:32Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - MMGRec: Multimodal Generative Recommendation with Transformer Model [81.61896141495144]
MMGRec aims to introduce a generative paradigm into multimodal recommendation.
We first devise a hierarchical quantization method Graph CF-RQVAE to assign Rec-ID for each item from its multimodal information.
We then train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences.
arXiv Detail & Related papers (2024-04-25T12:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.