Latent Structures Mining with Contrastive Modality Fusion for Multimedia
Recommendation
- URL: http://arxiv.org/abs/2111.00678v1
- Date: Mon, 1 Nov 2021 03:37:02 GMT
- Title: Latent Structures Mining with Contrastive Modality Fusion for Multimedia
Recommendation
- Authors: Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Mengqi Zhang, Shu Wu, Liang
Wang
- Abstract summary: We argue that the latent semantic item-item structures underlying multimodal contents could be beneficial for learning better item representations.
We devise a novel modality-aware structure learning module, which learns item-item relationships for each modality.
- Score: 22.701371886522494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed growing interests in multimedia recommendation,
which aims to predict whether a user will interact with an item with multimodal
contents. Previous studies focus on modeling user-item interactions with
multimodal features included as side information. However, this scheme is not
well-designed for multimedia recommendation. Firstly, only collaborative
item-item relationships are implicitly modeled through high-order
item-user-item co-occurrences. We argue that the latent semantic item-item
structures underlying these multimodal contents could be beneficial for
learning better item representations and assist the recommender models to
comprehensively discover candidate items. Secondly, previous studies disregard
the fine-grained multimodal fusion. Although having access to multiple
modalities might allow us to capture rich information, we argue that the simple
coarse-grained fusion by linear combination or concatenation in previous work
is insufficient to fully understand content information and item
relationships.To this end, we propose a latent structure MIning with
ContRastive mOdality fusion method (MICRO for brevity). To be specific, we
devise a novel modality-aware structure learning module, which learns item-item
relationships for each modality. Based on the learned modality-aware latent
item relationships, we perform graph convolutions that explicitly inject item
affinities to modality-aware item representations. Then, we design a novel
contrastive method to fuse multimodal features. These enriched item
representations can be plugged into existing collaborative filtering methods to
make more accurate recommendations. Extensive experiments on real-world
datasets demonstrate the superiority of our method over state-of-the-art
baselines.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - AlignRec: Aligning and Training in Multimodal Recommendations [29.995007279325947]
multimodal recommendations can leverage rich contexts beyond interactions.
Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features.
There exist semantic gaps among multimodal content features and ID-based features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items.
arXiv Detail & Related papers (2024-03-19T02:49:32Z) - MM-GEF: Multi-modal representation meet collaborative filtering [43.88159639990081]
We propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion.
MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals.
arXiv Detail & Related papers (2023-08-14T15:47:36Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.