A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation
- URL: http://arxiv.org/abs/2407.19886v1
- Date: Mon, 29 Jul 2024 11:04:31 GMT
- Title: A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation
- Authors: Zixuan Yi, Iadh Ounis,
- Abstract summary: We argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling.
We propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which leverages a multi-way transformer to extract aligned multi-modal features.
We show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
- Score: 9.720586396359906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of online multimedia services, especially in e-commerce platforms, there is a pressing need for personalised recommendation systems that can effectively encode the diverse multi-modal content associated with each item. However, we argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling. Such isolated processes can harm the recommendation performance. Firstly, an isolated extraction process underestimates the importance of effective feature extraction in multi-modal recommendations, potentially incorporating non-relevant information, which is harmful to item representations. Second, an isolated modality modelling process produces disjointed embeddings for item modalities due to the individual processing of each modality, which leads to a suboptimal fusion of user/item representations for effective user preferences prediction. We hypothesise that the use of a unified model for addressing both aforementioned isolated processes will enable the consistent extraction and cohesive fusion of joint multi-modal features, thereby enhancing the effectiveness of multi-modal recommender systems. In this paper, we propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which firstly leverages a multi-way transformer to extract aligned multi-modal features from raw data for top-k recommendation. Subsequently, we build a unified graph neural network in our UGT model to jointly fuse the user/item representations with their corresponding multi-modal features. Using the graph transformer architecture of our UGT model, we show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
Related papers
- Attention-based sequential recommendation system using multimodal data [8.110978727364397]
We propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories.
The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
arXiv Detail & Related papers (2024-05-28T08:41:05Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild [37.32217405723552]
We present an approach for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation.
We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features.
Following preprocessing and or convolution to align the extracted features, different models are employed for modal fusion.
arXiv Detail & Related papers (2024-03-22T09:00:24Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MISSRec: Pre-training and Transferring Multi-modal Interest-aware
Sequence Representation for Recommendation [61.45986275328629]
We propose MISSRec, a multi-modal pre-training and transfer learning framework for sequential recommendation.
On the user side, we design a Transformer-based encoder-decoder model, where the contextual encoder learns to capture the sequence-level multi-modal user interests.
On the candidate item side, we adopt a dynamic fusion module to produce user-adaptive item representation.
arXiv Detail & Related papers (2023-08-22T04:06:56Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Federated Multi-view Matrix Factorization for Personalized
Recommendations [53.74747022749739]
We introduce the federated multi-view matrix factorization method that extends the federated learning framework to matrix factorization with multiple data sources.
Our method is able to learn the multi-view model without transferring the user's personal data to a central server.
arXiv Detail & Related papers (2020-04-08T21:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.