Resolving References in Visually-Grounded Dialogue via Text Generation
- URL: http://arxiv.org/abs/2309.13430v1
- Date: Sat, 23 Sep 2023 17:07:54 GMT
- Title: Resolving References in Visually-Grounded Dialogue via Text Generation
- Authors: Bram Willemsen, Livia Qian, Gabriel Skantze
- Abstract summary: Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge.
We propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references.
We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot.
- Score: 3.8673630752805446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have shown to be effective at image retrieval
based on simple text queries, but text-image retrieval based on conversational
input remains a challenge. Consequently, if we want to use VLMs for reference
resolution in visually-grounded dialogue, the discourse processing capabilities
of these models need to be augmented. To address this issue, we propose
fine-tuning a causal large language model (LLM) to generate definite
descriptions that summarize coreferential information found in the linguistic
context of references. We then use a pretrained VLM to identify referents based
on the generated descriptions, zero-shot. We evaluate our approach on a
manually annotated dataset of visually-grounded dialogues and achieve results
that, on average, exceed the performance of the baselines we compare against.
Furthermore, we find that using referent descriptions based on larger context
windows has the potential to yield higher returns.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models [15.907584884933414]
We introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context.
We propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities.
Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs.
arXiv Detail & Related papers (2025-01-20T13:43:45Z) - RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.
We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.
Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z) - Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.
We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.
Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z) - Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models [2.3301643766310374]
By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data.
We show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods.
We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
arXiv Detail & Related papers (2024-08-29T06:54:03Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.