Related papers: Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

URL: http://arxiv.org/abs/2402.08670v1
Date: Tue, 13 Feb 2024 18:51:18 GMT
Title: Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Authors: Yuqing Liu, Yu Wang, Lichao Sun, Philip S. Yu
Abstract summary: We propose a novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST) We utilize user history as in-context user preferences to address the first challenge. Next, we prompt LVLMs to generate item image summaries and utilize image comprehension in natural language space combined with item titles to query the user preferences over candidate items.
Score: 48.129934341928355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of large vision-language models (LVLMs) offers the potential to address challenges faced by traditional multimodal recommendations thanks to their proficient understanding of static images and textual dynamics. However, the application of LVLMs in this field is still limited due to the following complexities: First, LVLMs lack user preference knowledge as they are trained from vast general datasets. Second, LVLMs suffer setbacks in addressing multiple image dynamics in scenarios involving discrete, noisy, and redundant image sequences. To overcome these issues, we propose the novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST) of leveraging large vision-language models for multimodal recommendation. We utilize user history as in-context user preferences to address the first challenge. Next, we prompt LVLMs to generate item image summaries and utilize image comprehension in natural language space combined with item titles to query the user preferences over candidate items. We conduct comprehensive experiments across four datasets with three LVLMs: GPT4-V, LLaVa-7b, and LLaVa-13b. The numerical results indicate the efficacy of VST.

Related papers

Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs [8.97780713904412]
This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in Large Vision-Language Models (LVLMs)<n>Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. Experiments on three LVLM benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead.
arXiv Detail & Related papers (2025-06-11T08:46:55Z)
ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models [15.907584884933414]
We introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context. We propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs.
arXiv Detail & Related papers (2025-01-20T13:43:45Z)
Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions [13.16300262271362]
Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA) This paper proposes a novel method to mitigate HoOA in LVLMs.
arXiv Detail & Related papers (2025-01-17T07:48:37Z)
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion [7.322448493179106]
Flow Text with Image Insertion task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. We introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains. We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models.
arXiv Detail & Related papers (2024-10-16T13:38:31Z)
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information [26.049228685973667]
Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. Currently, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. We propose a self-refinement framework designed to teach LVLMs to Selectively Retrieved Information (SURf)
arXiv Detail & Related papers (2024-09-21T09:36:14Z)
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models [29.795942154703642]
We propose the multi-image relation association task and a meticulously curated Multi-granularity Multi-image Association benchmark. Our experiments reveal that on the MMRA benchmark, current multi-image LVLMs exhibit distinct advantages and disadvantages across various subtasks. Our findings indicate that while LVLMs demonstrate a strong capability to perceive image details, enhancing their ability to associate information across multiple images hinges on improving the reasoning capabilities of their language model component.
arXiv Detail & Related papers (2024-07-24T15:59:01Z)
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs) QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.