VIP5: Towards Multimodal Foundation Models for Recommendation
- URL: http://arxiv.org/abs/2305.14302v2
- Date: Sat, 14 Oct 2023 18:09:31 GMT
- Title: VIP5: Towards Multimodal Foundation Models for Recommendation
- Authors: Shijie Geng and Juntao Tan and Shuchang Liu and Zuohui Fu and Yongfeng
Zhang
- Abstract summary: We propose the development of a multimodal foundation model (MFM) to unify various modalities and recommendation tasks.
To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format.
We also propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters.
- Score: 47.32368265586631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer Vision (CV), Natural Language Processing (NLP), and Recommender
Systems (RecSys) are three prominent AI applications that have traditionally
developed independently, resulting in disparate modeling and engineering
methodologies. This has impeded the ability for these fields to directly
benefit from each other's advancements. With the recent development of
foundation models, large language models have emerged as a potential
general-purpose interface for unifying different modalities and problem
formulations. In light of this, we propose the development of a multimodal
foundation model (MFM) considering visual, textual, and personalization
modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5),
to unify various modalities and recommendation tasks. This will enable the
processing of multiple modalities in a shared architecture for improved
recommendations. To achieve this, we introduce multimodal personalized prompts
to accommodate multiple modalities under a shared format. Additionally, we
propose a parameter-efficient training method for foundation models, which
involves freezing the P5 backbone and fine-tuning lightweight adapters,
resulting in improved recommendation performance and increased efficiency in
terms of training time and memory usage. Code and data of VIP5 are available at
https://github.com/jeykigung/VIP5.
Related papers
- Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [70.1326027641056]
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-21T12:18:15Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)
Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.
For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.
InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.
We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - E5-V: Universal Embeddings with Multimodal Large Language Models [51.5978154046302]
We introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings.
By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs.
E5-V achieves strong performance in multimodal embeddings even without fine-tuning.
arXiv Detail & Related papers (2024-07-17T14:04:12Z) - DiffMM: Multi-Modal Diffusion Model for Recommendation [19.43775593283657]
We propose a novel multi-modal graph diffusion model for recommendation called DiffMM.
Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning.
arXiv Detail & Related papers (2024-06-17T17:35:54Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized
Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models.
This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z) - What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Recommendation as Language Processing (RLP): A Unified Pretrain,
Personalized Prompt & Predict Paradigm (P5) [41.57432785137957]
We present a flexible and unified text-to-text paradigm called "Pretrain, Personalized Prompt, and Predict Paradigm" (P5) for recommendation.
All data such as user-item interactions, item metadata, and user reviews are converted to a common format -- natural language sequences.
P5 learns different tasks with the same language modeling objective during pretraining.
arXiv Detail & Related papers (2022-03-24T22:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.