Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation
- URL: http://arxiv.org/abs/2408.01363v1
- Date: Fri, 2 Aug 2024 16:15:25 GMT
- Title: Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation
- Authors: Jheng-Hong Yang, Jimmy Lin,
- Abstract summary: Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain.
This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion.
- Score: 56.49084589053732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textit{ad hoc} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's $\tau \sim 0.4$ when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's $\kappa$ value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
Related papers
- VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge [15.980606104936365]
Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications.
Existing frameworks like Alpaca-Eval 2.0 LC citedubois2024lengthcontrolledalpacaevalsimpleway and Arena-Hard v0.1 citeli2024crowdsourced are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts.
We introduce a novel data pipeline that curates, domain-specific evaluation sets tailored for LLM-as
arXiv Detail & Related papers (2024-08-16T15:41:43Z) - Large Language Models as Reliable Knowledge Bases? [60.25969380388974]
Large Language Models (LLMs) can be viewed as potential knowledge bases (KBs)
This study defines criteria that a reliable LLM-as-KB should meet, focusing on factuality and consistency.
strategies like ICL and fine-tuning are unsuccessful at making LLMs better KBs.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness.
RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm.
Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z) - Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset [7.954348293179786]
We propose CFLUE, a benchmark to assess the capability of large language models (LLMs) across various dimensions.
In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations.
In application assessment, it features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation.
arXiv Detail & Related papers (2024-05-17T05:03:40Z) - Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs)
It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions.
Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Split and Merge: Aligning Position Biases in Large Language Model based
Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.