Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation
- URL: http://arxiv.org/abs/2407.05420v1
- Date: Sun, 07 Jul 2024 15:56:03 GMT
- Title: Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation
- Authors: Xinglong Wu, Anfeng Huang, Hongwei Yang, Hui He, Yu Tai, Weizhe Zhang,
- Abstract summary: Multi-modal recommendation greatly enhances the performance of recommender systems.
Most existing multi-modal recommendation models exploit multimedia information propagation processes to enrich item representations.
We propose a novel framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information.
- Score: 12.306686291299146
- License:
- Abstract: Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance. Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel \textbf{CLIP} \textbf{E}nhanced \textbf{R}ecommender (\textbf{CLIPER}) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce a multi-view modality-alignment approach for representation extraction and measure the semantic similarity between modalities. Furthermore, we integrate the multi-view multimedia representations into downstream recommendation models. Extensive experiments conducted on three public datasets demonstrate the consistent superiority of our model over state-of-the-art multi-modal recommendation models.
Related papers
- QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou [23.818456863262494]
We introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.
Inspired by the two difficulties challenges in downstream tasks usage, we introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.
arXiv Detail & Related papers (2024-11-18T17:08:35Z) - A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation [9.720586396359906]
We argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling.
We propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which leverages a multi-way transformer to extract aligned multi-modal features.
We show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.
arXiv Detail & Related papers (2024-07-29T11:04:31Z) - DiffMM: Multi-Modal Diffusion Model for Recommendation [19.43775593283657]
We propose a novel multi-modal graph diffusion model for recommendation called DiffMM.
Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning.
arXiv Detail & Related papers (2024-06-17T17:35:54Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - BiVRec: Bidirectional View-based Multimodal Sequential Recommendation [55.87443627659778]
We propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views.
BivRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
arXiv Detail & Related papers (2024-02-27T09:10:41Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Adaptive Contrastive Learning on Multimodal Transformer for Review
Helpfulness Predictions [40.70793282367128]
We propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem.
In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach.
Finally, we propose Multimodal Interaction module to address the unalignment nature of multimodal data.
arXiv Detail & Related papers (2022-11-07T13:05:56Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - Linguistic Structure Guided Context Modeling for Referring Image
Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.