MM-GEF: Multi-modal representation meet collaborative filtering
- URL: http://arxiv.org/abs/2308.07222v2
- Date: Wed, 14 Aug 2024 08:38:00 GMT
- Title: MM-GEF: Multi-modal representation meet collaborative filtering
- Authors: Hao Wu, Alejandro Ariza-Casabona, Bartłomiej Twardowski, Tri Kurniawan Wijaya,
- Abstract summary: We propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion.
MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals.
- Score: 43.88159639990081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In modern e-commerce, item content features in various modalities offer accurate yet comprehensive information to recommender systems. The majority of previous work either focuses on learning effective item representation during modelling user-item interactions, or exploring item-item relationships by analysing multi-modal features. Those methods, however, fail to incorporate the collaborative item-user-item relationships into the multi-modal feature-based item structure. In this work, we propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion, which effectively combines the latent item structure underlying multi-modal contents with the collaborative signals. Instead of processing the content feature in different modalities separately, we show that the early-fusion of multi-modal features provides significant improvement. MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals. Through extensive experiments on four publicly available datasets, we demonstrate systematical improvements of our method over state-of-the-art multi-modal recommendation methods.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
Summarization [93.5217515566437]
Multi-modal Product Summarization (MPS) aims to increase customers' desire to purchase by highlighting product characteristics.
Existing MPS methods can produce promising results, but they still lack end-to-end product summarization.
We propose an end-to-end multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce.
arXiv Detail & Related papers (2023-08-22T11:00:09Z) - Using Multiple Instance Learning to Build Multimodal Representations [3.354271620160378]
Image-text multimodal representation learning aligns data across modalities and enables important medical applications.
We propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases.
arXiv Detail & Related papers (2022-12-11T18:01:11Z) - Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
arXiv Detail & Related papers (2022-09-02T08:59:57Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z) - Latent Structures Mining with Contrastive Modality Fusion for Multimedia
Recommendation [22.701371886522494]
We argue that the latent semantic item-item structures underlying multimodal contents could be beneficial for learning better item representations.
We devise a novel modality-aware structure learning module, which learns item-item relationships for each modality.
arXiv Detail & Related papers (2021-11-01T03:37:02Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.