Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
- URL: http://arxiv.org/abs/2402.08670v1
- Date: Tue, 13 Feb 2024 18:51:18 GMT
- Title: Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
- Authors: Yuqing Liu, Yu Wang, Lichao Sun, Philip S. Yu
- Abstract summary: We propose a novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST)
We utilize user history as in-context user preferences to address the first challenge.
Next, we prompt LVLMs to generate item image summaries and utilize image comprehension in natural language space combined with item titles to query the user preferences over candidate items.
- Score: 48.129934341928355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development of large vision-language models (LVLMs) offers the potential
to address challenges faced by traditional multimodal recommendations thanks to
their proficient understanding of static images and textual dynamics. However,
the application of LVLMs in this field is still limited due to the following
complexities: First, LVLMs lack user preference knowledge as they are trained
from vast general datasets. Second, LVLMs suffer setbacks in addressing
multiple image dynamics in scenarios involving discrete, noisy, and redundant
image sequences. To overcome these issues, we propose the novel reasoning
scheme named Rec-GPT4V: Visual-Summary Thought (VST) of leveraging large
vision-language models for multimodal recommendation. We utilize user history
as in-context user preferences to address the first challenge. Next, we prompt
LVLMs to generate item image summaries and utilize image comprehension in
natural language space combined with item titles to query the user preferences
over candidate items. We conduct comprehensive experiments across four datasets
with three LVLMs: GPT4-V, LLaVa-7b, and LLaVa-13b. The numerical results
indicate the efficacy of VST.
Related papers
- MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset.
MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks.
We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before
Projection [28.39885771124003]
We introduce Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.
arXiv Detail & Related papers (2023-11-16T10:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.