Large Language Models as Automated Aligners for benchmarking
Vision-Language Models
- URL: http://arxiv.org/abs/2311.14580v1
- Date: Fri, 24 Nov 2023 16:12:05 GMT
- Title: Large Language Models as Automated Aligners for benchmarking
Vision-Language Models
- Authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu,
Zhengguo Li, Ping Luo
- Abstract summary: Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks.
Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence.
In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
- Score: 48.4367174400306
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With the advancements in Large Language Models (LLMs), Vision-Language Models
(VLMs) have reached a new level of sophistication, showing notable competence
in executing intricate cognition and reasoning tasks. However, existing
evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to
measure task-specific performance, face significant limitations in assessing
the alignment of these increasingly anthropomorphic models with human
intelligence. In this work, we address the limitations via Auto-Bench, which
delves into exploring LLMs as proficient aligners, measuring the alignment
between VLMs and human intelligence and value through automatic data curation
and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs
(e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning
triplets via prompting on visual symbolic representations (e.g., captions,
object locations, instance relationships, and etc.). The curated data closely
matches human intent, owing to the extensive world knowledge embedded in LLMs.
Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered
question-answer-reasoning triplets have been curated, covering 4 primary
abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to
serve as judges, implementing the quantitative and qualitative automated
assessments to facilitate a comprehensive evaluation of VLMs. Our validation
results reveal that LLMs are proficient in both evaluation data curation and
model assessment, achieving an average agreement rate of 85%. We envision
Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating
the evolving sophisticated VLMs.
Related papers
- 3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset [13.808860456901204]
We introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench.
Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level.
We present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total.
arXiv Detail & Related papers (2024-04-23T02:06:10Z) - Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations.
Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z) - The Generative AI Paradox on Evaluation: What It Can Solve, It May Not
Evaluate [17.77014177096838]
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators.
We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA dataset.
arXiv Detail & Related papers (2024-02-09T06:16:08Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [44.401826163314716]
We propose a new evaluation paradigm for MLLMs using potent MLLM as the judge.
We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models.
The validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.