Structurally Refined Graph Transformer for Multimodal Recommendation
- URL: http://arxiv.org/abs/2511.00584v1
- Date: Sat, 01 Nov 2025 15:18:00 GMT
- Title: Structurally Refined Graph Transformer for Multimodal Recommendation
- Authors: Ke Shi, Yan Zhang, Miao Zhang, Lifan Chen, Jiali Yi, Kui Xiao, Xiaoju Hou, Zhifei Li,
- Abstract summary: We present SRGFormer, a structurally optimized multimodal recommendation model.<n>By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users.<n>Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items.
- Score: 13.296555757708298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal recommendation systems utilize various types of information, including images and text, to enhance the effectiveness of recommendations. The key challenge is predicting user purchasing behavior from the available data. Current recommendation models prioritize extracting multimodal information while neglecting the distinction between redundant and valuable data. They also rely heavily on a single semantic framework (e.g., local or global semantics), resulting in an incomplete or biased representation of user preferences, particularly those less expressed in prior interactions. Furthermore, these approaches fail to capture the complex interactions between users and items, limiting the model's ability to meet diverse users. To address these challenges, we present SRGFormer, a structurally optimized multimodal recommendation model. By modifying the transformer for better integration into our model, we capture the overall behavior patterns of users. Then, we enhance structural information by embedding multimodal information into a hypergraph structure to aid in learning the local structures between users and items. Meanwhile, applying self-supervised tasks to user-item collaborative signals enhances the integration of multimodal information, thereby revealing the representational features inherent to the data's modality. Extensive experiments on three public datasets reveal that SRGFormer surpasses previous benchmark models, achieving an average performance improvement of 4.47 percent on the Sports dataset. The code is publicly available online.
Related papers
- CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation [23.478610632707728]
We propose a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation.<n>At its core, CAMMSR introduces a category-guided attentive mixture of experts module, which learns specialized item representations from multiple perspectives.<n>Experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-04T17:39:35Z) - DReX: An Explainable Deep Learning-based Multimodal Recommendation Framework [2.631846982371029]
DReX is a unified multimodal recommendation framework that incrementally refines user and item representations by leveraging interaction-level features from multimodal feedback.<n>We evaluate the performance of the proposed approach on three real-world datasets containing reviews and ratings as interaction modalities.
arXiv Detail & Related papers (2026-02-23T10:52:20Z) - Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation [36.338586087343806]
We propose a novel framework, the Multimodal Representation-disentangled Information Bottleneck (MRdIB)<n>Concretely, we first employ a Multimodal Information Bottleneck to compress the input representations.<n>Then, we decompose the information based on its relationship with the recommendation target into unique, redundant, and synergistic components.
arXiv Detail & Related papers (2025-09-24T15:18:32Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - Multifaceted User Modeling in Recommendation: A Federated Foundation Models Approach [28.721903315405353]
Multifaceted user modeling aims to uncover fine-grained patterns and learn representations from user data.<n>Recent studies on foundation model-based recommendation have emphasized the Transformer architecture's remarkable ability to capture complex, non-linear user-item interaction relationships.<n>We propose a novel Transformer layer designed specifically for recommendation, using the self-attention mechanism to capture sequential user-item interaction patterns.
arXiv Detail & Related papers (2024-12-22T11:00:00Z) - Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities.<n>We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec.<n>Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z) - Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations [13.878297630442674]
This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality.<n>Our proposed model called Triple Modality Fusion (TMF) utilizes the power of large language models (LLMs) to align and integrate these three modalities.<n>Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy.
arXiv Detail & Related papers (2024-10-16T04:44:15Z) - A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation [9.720586396359906]
We argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling.
We propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which leverages a multi-way transformer to extract aligned multi-modal features.
We show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
arXiv Detail & Related papers (2024-07-29T11:04:31Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - BiVRec: Bidirectional View-based Multimodal Sequential Recommendation [55.87443627659778]
We propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views.
BivRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
arXiv Detail & Related papers (2024-02-27T09:10:41Z) - Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential
Recommendations [50.03560306423678]
We propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems.
Ada-Retrieval iteratively refines user representations to better capture potential candidates in the full item space.
arXiv Detail & Related papers (2024-01-12T15:26:40Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Knowledge-Enhanced Hierarchical Graph Transformer Network for
Multi-Behavior Recommendation [56.12499090935242]
This work proposes a Knowledge-Enhanced Hierarchical Graph Transformer Network (KHGT) to investigate multi-typed interactive patterns between users and items in recommender systems.
KHGT is built upon a graph-structured neural architecture to capture type-specific behavior characteristics.
We show that KHGT consistently outperforms many state-of-the-art recommendation methods across various evaluation settings.
arXiv Detail & Related papers (2021-10-08T09:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.