ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
- URL: http://arxiv.org/abs/2501.12418v1
- Date: Mon, 20 Jan 2025 13:43:45 GMT
- Title: ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
- Authors: Jingwei Yi, Junhao Yin, Ju Xu, Peng Bao, Yongliang Wang, Wei Fan, Hao Wang,
- Abstract summary: We introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context.
We propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities.
Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs.
- Score: 15.907584884933414
- License:
- Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.
Related papers
- VP-MEL: Visual Prompts Guided Multimodal Entity Linking [16.463229055333407]
Multimodal entity linking (MEL) is a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB)
Existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text.
We propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information.
arXiv Detail & Related papers (2024-12-09T18:06:39Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models [10.41857522464292]
We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to assess the long-context capabilities of MLLMs.
We employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval.
We evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models.
arXiv Detail & Related papers (2024-06-17T05:54:06Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Localized Symbolic Knowledge Distillation for Visual Commonsense Models [150.18129140140238]
We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input.
We train our model by sampling localized commonsense knowledge from a large language model.
We find that training on the localized commonsense corpus can successfully distill existing vision-language models to support a reference-as-input interface.
arXiv Detail & Related papers (2023-12-08T05:23:50Z) - Resolving References in Visually-Grounded Dialogue via Text Generation [3.8673630752805446]
Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge.
We propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references.
We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot.
arXiv Detail & Related papers (2023-09-23T17:07:54Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.