Large Language Models as Automated Aligners for benchmarking
Vision-Language Models
- URL: http://arxiv.org/abs/2311.14580v1
- Date: Fri, 24 Nov 2023 16:12:05 GMT
- Title: Large Language Models as Automated Aligners for benchmarking
Vision-Language Models
- Authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu,
Zhengguo Li, Ping Luo
- Abstract summary: Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks.
Existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence.
In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient curation, measuring the alignment betweenVLMs and human intelligence and value through automatic data curation and assessment.
- Score: 48.4367174400306
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With the advancements in Large Language Models (LLMs), Vision-Language Models
(VLMs) have reached a new level of sophistication, showing notable competence
in executing intricate cognition and reasoning tasks. However, existing
evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to
measure task-specific performance, face significant limitations in assessing
the alignment of these increasingly anthropomorphic models with human
intelligence. In this work, we address the limitations via Auto-Bench, which
delves into exploring LLMs as proficient aligners, measuring the alignment
between VLMs and human intelligence and value through automatic data curation
and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs
(e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning
triplets via prompting on visual symbolic representations (e.g., captions,
object locations, instance relationships, and etc.). The curated data closely
matches human intent, owing to the extensive world knowledge embedded in LLMs.
Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered
question-answer-reasoning triplets have been curated, covering 4 primary
abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to
serve as judges, implementing the quantitative and qualitative automated
assessments to facilitate a comprehensive evaluation of VLMs. Our validation
results reveal that LLMs are proficient in both evaluation data curation and
model assessment, achieving an average agreement rate of 85%. We envision
Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating
the evolving sophisticated VLMs.
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - PersoBench: Benchmarking Personalized Response Generation in Large Language Models [6.8046587254152735]
We present a new benchmark, PersoBench, to evaluate the personalization ability of large language models (LLMs) in persona-aware dialogue generation.
Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization.
arXiv Detail & Related papers (2024-10-04T07:29:41Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - 3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset [13.808860456901204]
We introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench.
Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level.
We present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total.
arXiv Detail & Related papers (2024-04-23T02:06:10Z) - Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations.
Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.