Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation
- URL: http://arxiv.org/abs/2310.14025v1
- Date: Sat, 21 Oct 2023 14:35:42 GMT
- Title: Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation
- Authors: Anastasia Kritharoula, Maria Lymperaiou and Giorgos Stamou
- Abstract summary: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates.
In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
- Score: 1.8591405259852054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the
goal of retrieving an image among a set of candidates, which better represents
the meaning of an ambiguous word within a given context. In this paper, we make
a substantial step towards unveiling this interesting task by applying a
varying set of approaches. Since VWSD is primarily a text-image retrieval task,
we explore the latest transformer-based methods for multimodal retrieval.
Additionally, we utilize Large Language Models (LLMs) as knowledge bases to
enhance the given phrases and resolve ambiguity related to the target word. We
also study VWSD as a unimodal problem by converting to text-to-text and
image-to-image retrieval, as well as question-answering (QA), to fully explore
the capabilities of relevant models. To tap into the implicit knowledge of
LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable
answer generation. On top of all, we train a learn to rank (LTR) model in order
to combine our different modules, achieving competitive ranking results.
Extensive experiments on VWSD demonstrate valuable insights to effectively
drive future directions.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with
Context Augmentation and Visual Assistance [5.5532783549057845]
We propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models.
Our system does not produce the most competitive results at SemEval-2023 Task 1, but we are still able to beat nearly half of the teams.
arXiv Detail & Related papers (2023-11-30T06:23:15Z) - Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task.
The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches.
Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.