Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation
- URL: http://arxiv.org/abs/2310.14025v1
- Date: Sat, 21 Oct 2023 14:35:42 GMT
- Title: Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation
- Authors: Anastasia Kritharoula, Maria Lymperaiou and Giorgos Stamou
- Abstract summary: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates.
In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
- Score: 1.8591405259852054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the
goal of retrieving an image among a set of candidates, which better represents
the meaning of an ambiguous word within a given context. In this paper, we make
a substantial step towards unveiling this interesting task by applying a
varying set of approaches. Since VWSD is primarily a text-image retrieval task,
we explore the latest transformer-based methods for multimodal retrieval.
Additionally, we utilize Large Language Models (LLMs) as knowledge bases to
enhance the given phrases and resolve ambiguity related to the target word. We
also study VWSD as a unimodal problem by converting to text-to-text and
image-to-image retrieval, as well as question-answering (QA), to fully explore
the capabilities of relevant models. To tap into the implicit knowledge of
LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable
answer generation. On top of all, we train a learn to rank (LTR) model in order
to combine our different modules, achieving competitive ranking results.
Extensive experiments on VWSD demonstrate valuable insights to effectively
drive future directions.
Related papers
- ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers.
ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution.
We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with
Context Augmentation and Visual Assistance [5.5532783549057845]
We propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models.
Our system does not produce the most competitive results at SemEval-2023 Task 1, but we are still able to beat nearly half of the teams.
arXiv Detail & Related papers (2023-11-30T06:23:15Z) - Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task.
The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches.
Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.