VIP5: Towards Multimodal Foundation Models for Recommendation
- URL: http://arxiv.org/abs/2305.14302v2
- Date: Sat, 14 Oct 2023 18:09:31 GMT
- Title: VIP5: Towards Multimodal Foundation Models for Recommendation
- Authors: Shijie Geng and Juntao Tan and Shuchang Liu and Zuohui Fu and Yongfeng
Zhang
- Abstract summary: We propose the development of a multimodal foundation model (MFM) to unify various modalities and recommendation tasks.
To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format.
We also propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters.
- Score: 47.32368265586631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer Vision (CV), Natural Language Processing (NLP), and Recommender
Systems (RecSys) are three prominent AI applications that have traditionally
developed independently, resulting in disparate modeling and engineering
methodologies. This has impeded the ability for these fields to directly
benefit from each other's advancements. With the recent development of
foundation models, large language models have emerged as a potential
general-purpose interface for unifying different modalities and problem
formulations. In light of this, we propose the development of a multimodal
foundation model (MFM) considering visual, textual, and personalization
modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5),
to unify various modalities and recommendation tasks. This will enable the
processing of multiple modalities in a shared architecture for improved
recommendations. To achieve this, we introduce multimodal personalized prompts
to accommodate multiple modalities under a shared format. Additionally, we
propose a parameter-efficient training method for foundation models, which
involves freezing the P5 backbone and fine-tuning lightweight adapters,
resulting in improved recommendation performance and increased efficiency in
terms of training time and memory usage. Code and data of VIP5 are available at
https://github.com/jeykigung/VIP5.
Related papers
- E5-V: Universal Embeddings with Multimodal Large Language Models [51.5978154046302]
We introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings.
By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs.
E5-V achieves strong performance in multimodal embeddings even without fine-tuning.
arXiv Detail & Related papers (2024-07-17T14:04:12Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized
Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models.
This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z) - PILL: Plug Into LLM with Adapter Expert and Attention Gate [11.956931222769128]
We introduce a novel architecture called PILL: Plug Into LLM with adapter expert and attention gate.
We introduce two modules: Firstly, utilizing Mixture-of-Modality-Adapter-Expert to independently handle different modalities.
Secondly, by introducing Modality-Attention-Gating, which enables adaptive control of the contribution of modality tokens to the overall representation.
arXiv Detail & Related papers (2023-11-03T09:31:10Z) - What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Recommendation as Language Processing (RLP): A Unified Pretrain,
Personalized Prompt & Predict Paradigm (P5) [41.57432785137957]
We present a flexible and unified text-to-text paradigm called "Pretrain, Personalized Prompt, and Predict Paradigm" (P5) for recommendation.
All data such as user-item interactions, item metadata, and user reviews are converted to a common format -- natural language sequences.
P5 learns different tasks with the same language modeling objective during pretraining.
arXiv Detail & Related papers (2022-03-24T22:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.