Related papers: VIP5: Towards Multimodal Foundation Models for Recommendation

VIP5: Towards Multimodal Foundation Models for Recommendation

URL: http://arxiv.org/abs/2305.14302v2
Date: Sat, 14 Oct 2023 18:09:31 GMT
Title: VIP5: Towards Multimodal Foundation Models for Recommendation
Authors: Shijie Geng and Juntao Tan and Shuchang Liu and Zuohui Fu and Yongfeng Zhang
Abstract summary: We propose the development of a multimodal foundation model (MFM) to unify various modalities and recommendation tasks. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. We also propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters.
Score: 47.32368265586631
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer Vision (CV), Natural Language Processing (NLP), and Recommender Systems (RecSys) are three prominent AI applications that have traditionally developed independently, resulting in disparate modeling and engineering methodologies. This has impeded the ability for these fields to directly benefit from each other's advancements. With the recent development of foundation models, large language models have emerged as a potential general-purpose interface for unifying different modalities and problem formulations. In light of this, we propose the development of a multimodal foundation model (MFM) considering visual, textual, and personalization modalities under the P5 recommendation paradigm, thus named VIP5 (Visual P5), to unify various modalities and recommendation tasks. This will enable the processing of multiple modalities in a shared architecture for improved recommendations. To achieve this, we introduce multimodal personalized prompts to accommodate multiple modalities under a shared format. Additionally, we propose a parameter-efficient training method for foundation models, which involves freezing the P5 backbone and fine-tuning lightweight adapters, resulting in improved recommendation performance and increased efficiency in terms of training time and memory usage. Code and data of VIP5 are available at https://github.com/jeykigung/VIP5.

Related papers

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [70.1326027641056]
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-21T12:18:15Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks. We first introduce a unified condition format termed Long-context Condition Unit (LCU) We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z)
E5-V: Universal Embeddings with Multimodal Large Language Models [51.5978154046302]
We introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs. E5-V achieves strong performance in multimodal embeddings even without fine-tuning.
arXiv Detail & Related papers (2024-07-17T14:04:12Z)
DiffMM: Multi-Modal Diffusion Model for Recommendation [19.43775593283657]
We propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning.
arXiv Detail & Related papers (2024-06-17T17:35:54Z)
Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP) We develop a generic and personalization generative framework, that can handle a wide range of personalized needs. Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z)
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? [24.676820488258336]
Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images. These models rely on design choices such as network structures, training data, and training strategies. This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
arXiv Detail & Related papers (2023-07-05T17:44:28Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5) [41.57432785137957]
We present a flexible and unified text-to-text paradigm called "Pretrain, Personalized Prompt, and Predict Paradigm" (P5) for recommendation. All data such as user-item interactions, item metadata, and user reviews are converted to a common format -- natural language sequences. P5 learns different tasks with the same language modeling objective during pretraining.
arXiv Detail & Related papers (2022-03-24T22:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.