MONET: Modality-Embracing Graph Convolutional Network and Target-Aware
Attention for Multimedia Recommendation
- URL: http://arxiv.org/abs/2312.09511v1
- Date: Fri, 15 Dec 2023 03:28:19 GMT
- Title: MONET: Modality-Embracing Graph Convolutional Network and Target-Aware
Attention for Multimedia Recommendation
- Authors: Yungi Kim, Taeri Kim, Won-Yong Shin, and Sang-Wook Kim
- Abstract summary: We focus on multimedia recommender systems using graph convolutional networks (GCNs)
Our study aims to exploit multimodal features more effectively in order to accurately capture users' preferences for items.
We propose a novel multimedia recommender system, named MONET, composed of following two core ideas: modality-embracing GCN (MeGCN) and target-aware attention.
- Score: 21.61057660080108
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we focus on multimedia recommender systems using graph
convolutional networks (GCNs) where the multimodal features as well as
user-item interactions are employed together. Our study aims to exploit
multimodal features more effectively in order to accurately capture users'
preferences for items. To this end, we point out following two limitations of
existing GCN-based multimedia recommender systems: (L1) although multimodal
features of interacted items by a user can reveal her preferences on items,
existing methods utilize GCN designed to focus only on capturing collaborative
signals, resulting in insufficient reflection of the multimodal features in the
final user/item embeddings; (L2) although a user decides whether to prefer the
target item by considering its multimodal features, existing methods represent
her as only a single embedding regardless of the target item's multimodal
features and then utilize her embedding to predict her preference for the
target item. To address the above issues, we propose a novel multimedia
recommender system, named MONET, composed of following two core ideas:
modality-embracing GCN (MeGCN) and target-aware attention. Through extensive
experiments using four real-world datasets, we demonstrate i) the significant
superiority of MONET over seven state-of-the-art competitors (up to 30.32%
higher accuracy in terms of recall@20, compared to the best competitor) and ii)
the effectiveness of the two core ideas in MONET. All MONET codes are available
at https://github.com/Kimyungi/MONET.
Related papers
- Enhancing Live Broadcast Engagement: A Multi-modal Approach to Short Video Recommendations Using MMGCN and User Preferences [0.0]
This paper develops a short video recommendation system that incorporates Multi-modal Graph Convolutional Networks (MMGCN) with user preferences.<n>In order to provide personalized recommendations tailored to individual interests, the proposed system takes into account user interaction data, video content features, and contextual information.<n>Three datasets are used to evaluate the effectiveness of the system: Kwai, TikTok, and MovieLens.
arXiv Detail & Related papers (2025-06-29T04:50:52Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - Quadratic Interest Network for Multimodal Click-Through Rate Prediction [12.989347150912685]
Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems.
We propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction.
arXiv Detail & Related papers (2025-04-24T16:08:52Z) - Less is More: Information Bottleneck Denoised Multimedia Recommendation [43.66791467993419]
We propose a denoised multimedia recommendation paradigm via the Information Bottleneck principle (IB)
IBMRec removes task-irrelevant features from both feature and item-item structure perspectives.
It is achieved by maximizing the mutual information between multimedia representation and recommendation tasks.
arXiv Detail & Related papers (2025-01-21T14:33:07Z) - Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation [97.82707398481273]
We develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF)
Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner.
We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models.
arXiv Detail & Related papers (2025-01-13T07:51:43Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.
Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.
We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - MIMNet: Multi-Interest Meta Network with Multi-Granularity Target-Guided Attention for Cross-domain Recommendation [6.7902741961967]
Cross-domain recommendation (CDR) plays a critical role in alleviating the sparsity and cold-start problem.
We propose a novel method named Multi-interest Meta Network with Multi-granularity Target-guided Attention (MIMNet) for cross-domain recommendation.
arXiv Detail & Related papers (2024-07-31T13:30:34Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-driven
Approach [11.600496805298778]
Just noticeable difference (JND) refers to the maximum visual change that human eyes cannot perceive.
In this article, we investigate the JND modeling from an end-to-end multimodal perspective, namely hmJND-Net.
arXiv Detail & Related papers (2023-03-18T09:36:59Z) - M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient
Object Detection [1.002712867721496]
Methods based on RGB-D often suffer from the incompatibility of multi-modal feature fusion and the insufficiency of multi-scale feature aggregation.
We propose a novel multi-modal and multi-scale refined network (M2RNet)
Three essential components are presented in this network.
arXiv Detail & Related papers (2021-09-16T12:15:40Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion
Recognition in Conversation [32.15124603618625]
We propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work.
MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency.
We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN.
arXiv Detail & Related papers (2021-07-14T15:37:02Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.