Related papers: Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark

Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark

URL: http://arxiv.org/abs/2305.18212v1
Date: Fri, 26 May 2023 08:43:46 GMT
Title: Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark
Authors: Yuxing Long, Binyuan Hui, Caixia Yuan1, Fei Huang, Yongbin Li, Xiaojie Wang
Abstract summary: This paper introduces a new dataset SURE (Multimodal Recommendation Dialog with SUbjective PREference) The data is built in two phases with human annotations to ensure quality and diversity. SURE is well-annotated with subjective preferences and recommendation acts proposed by sales experts.
Score: 38.613625892808706
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing multimodal task-oriented dialog data fails to demonstrate the diverse expressions of user subjective preferences and recommendation acts in the real-life shopping scenario. This paper introduces a new dataset SURE (Multimodal Recommendation Dialog with SUbjective PREference), which contains 12K shopping dialogs in complex store scenes. The data is built in two phases with human annotations to ensure quality and diversity. SURE is well-annotated with subjective preferences and recommendation acts proposed by sales experts. A comprehensive analysis is given to reveal the distinguishing features of SURE. Three benchmark tasks are then proposed on the data to evaluate the capability of multimodal recommendation agents. Based on the SURE, we propose a baseline model, powered by a state-of-the-art multimodal model, for these tasks.

Related papers

Multi-Interest Recommendation: A Survey [67.28277752101006]
Multi-interest recommendation addresses the challenge of extracting multiple interest representations from users' historical interactions.<n>It has drawn broad interest in recommendation research.<n>We systematically review the progress, solutions, challenges, and future directions of multi-interest recommendation.
arXiv Detail & Related papers (2025-06-18T09:05:32Z)
ReviewInstruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models [9.660334829409253]
Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions.<n>We propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process.
arXiv Detail & Related papers (2025-05-16T08:59:07Z)
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs. Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text. Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
Joint Modeling in Recommendations: A Survey [46.000357352884926]
Joint modeling approaches are central to overcoming limitations by integrating diverse tasks, scenarios, modalities, and behaviors in the recommendation process. We define the scope of joint modeling through four distinct dimensions: multi-task, multi-scenario, multi-modal, and multi-behavior modeling. We highlight several promising avenues for future exploration in joint modeling for recommendations and provide a concise conclusion to our findings.
arXiv Detail & Related papers (2025-02-28T16:14:00Z)
Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline. We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z)
Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation [9.506245109666907]
Multi-faceted features characterizing products and services may influence each customer on online selling platforms differently. The common multimodal recommendation pipeline involves (i) extracting multimodal features, (ii) refining their high-level representations to suit the recommendation task, and (iv) predicting the user-item score. This paper settles as the first attempt to offer a large-scale benchmarking for multimodal recommender systems, with a specific focus on multimodal extractors.
arXiv Detail & Related papers (2024-09-24T08:29:10Z)
Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation [12.306686291299146]
Multi-modal recommendation greatly enhances the performance of recommender systems. Most existing multi-modal recommendation models exploit multimedia information propagation processes to enrich item representations. We propose a novel framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information.
arXiv Detail & Related papers (2024-07-07T15:56:03Z)
BiVRec: Bidirectional View-based Multimodal Sequential Recommendation [55.87443627659778]
We propose an innovative framework, BivRec, that jointly trains the recommendation tasks in both ID and multimodal views. BivRec achieves state-of-the-art performance on five datasets and showcases various practical advantages.
arXiv Detail & Related papers (2024-02-27T09:10:41Z)
Ada-Retrieval: An Adaptive Multi-Round Retrieval Paradigm for Sequential Recommendations [50.03560306423678]
We propose Ada-Retrieval, an adaptive multi-round retrieval paradigm for recommender systems. Ada-Retrieval iteratively refines user representations to better capture potential candidates in the full item space.
arXiv Detail & Related papers (2024-01-12T15:26:40Z)
DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
Application of frozen large-scale models to multimodal task-oriented dialogue [0.0]
We use the existing Large Language Models ENnhanced to See Framework (LENS Framework) to test the feasibility of multimodal task-oriented dialogues. The LENS Framework has been proposed as a method to solve computer vision tasks without additional training and with fixed parameters of pre-trained models.
arXiv Detail & Related papers (2023-10-02T01:42:28Z)
Large Language Models as Zero-Shot Conversational Recommenders [52.57230221644014]
We present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting. We construct a new dataset of recommendation-related conversations by scraping a popular discussion website. We observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models.
arXiv Detail & Related papers (2023-08-19T15:29:45Z)
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset [7.0430001782867]
We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. We analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification.
arXiv Detail & Related papers (2023-07-06T08:02:45Z)
DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization [127.714919036388]
DIONYSUS is a pre-trained encoder-decoder model for summarizing dialogues in any new domain. Our experiments show that DIONYSUS outperforms existing methods on six datasets.
arXiv Detail & Related papers (2022-12-20T06:21:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.