ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation
- URL: http://arxiv.org/abs/2311.05956v2
- Date: Wed, 22 May 2024 10:59:03 GMT
- Title: ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation
- Authors: Yuting Liu, Enneng Yang, Yizhou Dang, Guibing Guo, Qiang Liu, Yuliang Liang, Linying Jiang, Xingwei Wang,
- Abstract summary: We propose a novel recommendation model by incorporating ID embeddings to enhance the salient features of both content and structure.
Our method is superior to state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.
- Score: 13.338363107777438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of \emph{content} and \emph{structure}. Based on our findings, we propose a novel recommendation model by incorporating ID embeddings to enhance the salient features of both content and structure. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolution network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings. Our code is available at https://anonymous.4open.science/r/IDSF-code/.
Related papers
- Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation [26.737971605928358]
We propose an ID-free MultimOdal TOken Representation scheme named MOTOR.
We first employ product quantization to discretize each item's multimodal features into discrete token IDs.
We then interpret the token embeddings corresponding to these token IDs as implicit item features.
The resulting representations can replace the original ID embeddings and transform the original multimodal recommender into ID-free system.
arXiv Detail & Related papers (2024-10-25T03:06:10Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - AlignRec: Aligning and Training in Multimodal Recommendations [29.995007279325947]
multimodal recommendations can leverage rich contexts beyond interactions.
Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features.
There exist semantic gaps among multimodal content features and ID-based features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items.
arXiv Detail & Related papers (2024-03-19T02:49:32Z) - Multi-Prompts Learning with Cross-Modal Alignment for Attribute-based
Person Re-Identification [18.01407937934588]
We present a new framework called Multi-Prompts ReID (MP-ReID) based on prompt learning and language models.
MP-ReID learns to hallucinate diverse, informative, and promptable sentences for describing the query images.
Explicit prompts are obtained by ensembling generation models, such as ChatGPT and VQA models.
arXiv Detail & Related papers (2023-12-28T03:00:19Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification [78.52704557647438]
We propose a novel FIne-grained Representation and Recomposition (FIRe$2$) framework to tackle both limitations without any auxiliary annotation or data.
Experiments demonstrate that FIRe$2$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
arXiv Detail & Related papers (2023-08-21T12:59:48Z) - MM-GEF: Multi-modal representation meet collaborative filtering [43.88159639990081]
We propose a graph-based item structure enhancement method MM-GEF: Multi-Modal recommendation with Graph Early-Fusion.
MM-GEF learns refined item representations by injecting structural information obtained from both multi-modal and collaborative signals.
arXiv Detail & Related papers (2023-08-14T15:47:36Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Mining Latent Structures for Multimedia Recommendation [46.70109406399858]
We propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity.
We learn item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs.
Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations.
arXiv Detail & Related papers (2021-04-19T03:50:24Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.