Related papers: Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

URL: http://arxiv.org/abs/2410.12228v1
Date: Wed, 16 Oct 2024 04:44:15 GMT
Title: Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations
Authors: Luyi Ma, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sushant Kumar, Kannan Achan,
Abstract summary: This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of large language models (LLMs) to align and integrate these three modalities. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy.
Score: 12.154043062308201
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

Related papers

Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z)
Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities. We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec. Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z)
LLM-assisted Explicit and Implicit Multi-interest Learning Framework for Sequential Recommendation [50.98046887582194]
We propose an explicit and implicit multi-interest learning framework to model user interests on two levels: behavior and semantics. The proposed EIMF framework effectively and efficiently combines small models with LLM to improve the accuracy of multi-interest modeling.
arXiv Detail & Related papers (2024-11-14T13:00:23Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
DiffMM: Multi-Modal Diffusion Model for Recommendation [19.43775593283657]
We propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning.
arXiv Detail & Related papers (2024-06-17T17:35:54Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
RecExplainer: Aligning Large Language Models for Explaining Recommendation Models [50.74181089742969]
Large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following. This paper presents the initial exploration of using LLMs as surrogate models to explain black-box recommender models. To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment.
arXiv Detail & Related papers (2023-11-18T03:05:43Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues. We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects. Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.