AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2309.06495v1
- Date: Tue, 5 Sep 2023 13:43:37 GMT
- Title: AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
- Authors: Fei Tang, Wanling Gao, Luzhou Peng, Jianfeng Zhan
- Abstract summary: How to evaluate the question-solving abilities of large language models like ChatGPT is a hot-spot but challenging issue.
We propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs.
- Score: 3.518832148294879
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) like ChatGPT have revealed amazing intelligence.
How to evaluate the question-solving abilities of LLMs and their degrees of
intelligence is a hot-spot but challenging issue. First, the question-solving
abilities are interlaced with different ability branches like understanding and
massive knowledge categories like mathematics. Second, the inputs of questions
are multimodal that may involve text and images. Third, the response format of
LLMs is diverse and thus poses great challenges for result extraction and
evaluation. In this paper, we propose AGIBench -- a multi-granularity,
multimodal, human-referenced, and auto-scoring benchmarking methodology for
LLMs. Instead of a collection of blended questions, AGIBench focuses on three
typical ability branches and adopts a four-tuple <ability branch, knowledge,
difficulty, modal> to label the attributes of each question. First, it supports
multi-granularity benchmarking, e.g., per-question, per-ability branch,
per-knowledge, per-modal, per-dataset, and per-difficulty level granularities.
Second, it contains multimodal input, including text and images. Third, it
classifies all the questions into five degrees of difficulty according to the
average accuracy rate of abundant educated humans (human-referenced). Fourth,
it adopts zero-shot learning to avoid introducing additional unpredictability
and provides an auto-scoring method to extract and judge the result. Finally,
it defines multi-dimensional metrics, including accuracy under the average,
worst, best, and majority voting cases, and repeatability. AGIBench is
publically available from \url{https://www.benchcouncil.org/agibench}.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI [0.0]
We study the effectiveness of large language models (LLMs) on different question answering tasks.
We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets.
Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent"
arXiv Detail & Related papers (2024-07-04T09:38:49Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.