Language Models as Knowledge Bases for Visual Word Sense Disambiguation
- URL: http://arxiv.org/abs/2310.01960v1
- Date: Tue, 3 Oct 2023 11:11:55 GMT
- Title: Language Models as Knowledge Bases for Visual Word Sense Disambiguation
- Authors: Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou
- Abstract summary: We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
- Score: 1.8591405259852054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual Word Sense Disambiguation (VWSD) is a novel challenging task that lies
between linguistic sense disambiguation and fine-grained multimodal retrieval.
The recent advancements in the development of visiolinguistic (VL) transformers
suggest some off-the-self implementations with encouraging results, which
however we argue that can be further improved. To this end, we propose some
knowledge-enhancement techniques towards improving the retrieval performance of
VL transformers via the usage of Large Language Models (LLMs) as Knowledge
Bases. More specifically, knowledge stored in LLMs is retrieved with the help
of appropriate prompts in a zero-shot manner, achieving performance
advancements. Moreover, we convert VWSD to a purely textual question-answering
(QA) problem by considering generated image captions as multiple-choice
candidate answers. Zero-shot and few-shot prompting strategies are leveraged to
explore the potential of such a transformation, while Chain-of-Thought (CoT)
prompting in the zero-shot setting is able to reveal the internal reasoning
steps an LLM follows to select the appropriate candidate. In total, our
presented approach is the first one to analyze the merits of exploiting
knowledge stored in LLMs in different ways to solve WVSD.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models [26.964848679914354]
CoKnow is a framework that enhances Prompt Learning for Vision-Language Models with rich contextual knowledge.
We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.
arXiv Detail & Related papers (2024-04-16T07:44:52Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - How to Configure Good In-Context Sequence for Visual Question Answering [19.84012680826303]
In this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations.
Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations.
We uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.
arXiv Detail & Related papers (2023-12-04T02:03:23Z) - Filling the Image Information Gap for VQA: Prompting Large Language
Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge.
As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure.
We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z) - Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation [1.8591405259852054]
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates.
In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
arXiv Detail & Related papers (2023-10-21T14:35:42Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and
Beyond [2.9005223064604078]
Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information.
We introduce a more principled approach to leverage information from all layers of NLMs, informed by a probing analysis on 14 NLM variants.
We also emphasize the versatility of these sense embeddings in contrast to task-specific models, applying them on several sense-related tasks, besides WSD.
arXiv Detail & Related papers (2021-05-26T10:14:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.