Related papers: Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

URL: http://arxiv.org/abs/2412.18176v2
Date: Mon, 30 Dec 2024 09:24:34 GMT
Title: Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
Authors: Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang,
Abstract summary: Molar is a sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively.<n>By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy.
Score: 4.518104756199573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.

Related papers

DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation [13.114773060703891]
We propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR)<n>For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs.<n>For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics.
arXiv Detail & Related papers (2026-02-14T10:42:56Z)
RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment [23.738860191046538]
We propose RecGOAT, a novel yet simple dual semantic alignment framework for multimodal recommendation.<n>We show that RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights.
arXiv Detail & Related papers (2026-01-31T11:58:38Z)
MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model [18.920729109005435]
This paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation.<n>We first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty.<n>Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context.
arXiv Detail & Related papers (2025-12-24T03:44:25Z)
Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation [9.37169920239321]
Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content.<n>While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity.<n>This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations.
arXiv Detail & Related papers (2025-08-06T15:53:58Z)
Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z)
HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression [33.34435467588446]
HistLLM is an innovative framework that integrates textual and visual features through a User History. Module (UHEM), compressing user history interactions into a single token representation. Extensive experiments demonstrate the effectiveness and efficiency of our proposed mechanism.
arXiv Detail & Related papers (2025-04-14T12:01:11Z)
Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions. Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness. We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs. We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
LLM-based Bi-level Multi-interest Learning Framework for Sequential Recommendation [54.396000434574454]
We propose a novel multi-interest SR framework combining implicit behavioral and explicit semantic perspectives.<n>It includes two modules: the Implicit Behavioral Interest Module and the Explicit Semantic Interest Module.<n>Experiments on four real-world datasets validate the framework's effectiveness and practicality.
arXiv Detail & Related papers (2024-11-14T13:00:23Z)
Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information [76.62949982303532]
We propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser) In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. M-Former is a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems.
arXiv Detail & Related papers (2024-09-03T04:55:03Z)
Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation [21.281471662696372]
We propose the Multimodal Large Language Model-enhanced Multimodaln Sequential Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. We then employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences.
arXiv Detail & Related papers (2024-08-19T04:44:32Z)
Beyond Inter-Item Relations: Dynamic Adaption for Enhancing LLM-Based Sequential Recommendation [83.87767101732351]
Sequential recommender systems (SRS) predict the next items that users may prefer based on user historical interaction sequences. Inspired by the rise of large language models (LLMs) in various AI applications, there is a surge of work on LLM-based SRS. We propose DARec, a sequential recommendation model built on top of coarse-grained adaption for capturing inter-item relations.
arXiv Detail & Related papers (2024-08-14T10:03:40Z)
MMREC: LLM Based Multi-Modal Recommender System [2.3113916776957635]
This paper presents a novel approach to enhancing recommender systems by leveraging Large Language Models (LLMs) and deep learning techniques. The proposed framework aims to improve the accuracy and relevance of recommendations by incorporating multi-modal information processing and by the use of unified latent space representation.
arXiv Detail & Related papers (2024-08-08T04:31:29Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach. Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens. We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.