Related papers: Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

URL: http://arxiv.org/abs/2403.10667v2
Date: Wed, 27 Mar 2024 21:11:19 GMT
Title: Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond
Authors: Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, Xianfeng Tang,
Abstract summary: Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP) We develop a generic and personalization generative framework, that can handle a wide range of personalized needs. Our methodology enhances the capabilities of foundational language models for personalized tasks.
Score: 87.1712108247199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on the ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational language models for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covering a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.

Related papers

Personalization Toolkit: Training Free Personalization of Large Vision Language Models [11.026377387506216]
This paper introduces a training-free approach to LVLM personalization by leveraging pre-trained vision foundation models. Our model-agnostic vision toolkit enables flexible and efficient personalization without the need for extensive retraining.
arXiv Detail & Related papers (2025-02-04T16:19:20Z)
Multifaceted User Modeling in Recommendation: A Federated Foundation Models Approach [28.721903315405353]
Multifaceted user modeling aims to uncover fine-grained patterns and learn representations from user data. Recent studies on foundation model-based recommendation have emphasized the Transformer architecture's remarkable ability to capture complex, non-linear user-item interaction relationships. We propose a novel Transformer layer designed specifically for recommendation, using the self-attention mechanism to capture sequential user-item interaction patterns.
arXiv Detail & Related papers (2024-12-22T11:00:00Z)
Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z)
Personalized Image Generation with Large Multimodal Models [47.289887243367055]
We propose a Personalized Image Generation Framework named Pigeon to capture users' visual preferences and needs from noisy user history and multimodal instructions. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
arXiv Detail & Related papers (2024-10-18T04:20:46Z)
PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization [9.594958534074074]
We introduce the PEFT-U Benchmark: a new dataset for building and evaluating NLP models for user personalization. We explore the challenge of efficiently personalizing LLMs to accommodate user-specific preferences in the context of diverse user-centered tasks.
arXiv Detail & Related papers (2024-07-25T14:36:18Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation. Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts. We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
Generating Illustrated Instructions [41.613203340244155]
We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion.
arXiv Detail & Related papers (2023-12-07T18:59:20Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities [60.5609416496429]
The capability of large language models has been dramatically improved. Such a major leap-forward in general AI capacity will change the pattern of how personalization is conducted. By leveraging large language models as general-purpose interface, personalization systems may compile user requests into plans.
arXiv Detail & Related papers (2023-07-31T02:48:56Z)
Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation [0.0]
We leverage the Bradley-Terry preference model to develop a fast adaptation method that efficiently fine-tunes the original model. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding.
arXiv Detail & Related papers (2023-07-15T07:53:12Z)
Personalized Multimodal Feedback Generation in Education [50.95346877192268]
The automatic evaluation for school assignments is an important application of AI in the education field. We propose a novel Personalized Multimodal Feedback Generation Network (PMFGN) armed with a modality gate mechanism and a personalized bias mechanism. Our model significantly outperforms several baselines by generating more accurate and diverse feedback.
arXiv Detail & Related papers (2020-10-31T05:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.