Benchmarking Foundation Models with Language-Model-as-an-Examiner
- URL: http://arxiv.org/abs/2306.04181v2
- Date: Sat, 4 Nov 2023 11:50:13 GMT
- Title: Benchmarking Foundation Models with Language-Model-as-an-Examiner
- Authors: Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang,
Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei
Hou
- Abstract summary: We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
- Score: 47.345760054595246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous benchmarks have been established to assess the performance of
foundation models on open-ended question answering, which serves as a
comprehensive test of a model's ability to understand and generate language in
a manner similar to humans. Most of these works focus on proposing new
datasets, however, we see two main issues within previous benchmarking
pipelines, namely testing leakage and evaluation automation. In this paper, we
propose a novel benchmarking framework, Language-Model-as-an-Examiner, where
the LM serves as a knowledgeable examiner that formulates questions based on
its knowledge and evaluates responses in a reference-free manner. Our framework
allows for effortless extensibility as various LMs can be adopted as the
examiner, and the questions can be constantly updated given more diverse
trigger topics. For a more comprehensive and equitable evaluation, we devise
three strategies: (1) We instruct the LM examiner to generate questions across
a multitude of domains to probe for a broad acquisition, and raise follow-up
questions to engage in a more in-depth assessment. (2) Upon evaluation, the
examiner combines both scoring and ranking measurements, providing a reliable
result as it aligns closely with human annotations. (3) We additionally propose
a decentralized Peer-examination method to address the biases in a single
examiner. Our data and benchmarking results are available at:
http://lmexam.xlore.cn.
Related papers
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - MERA: A Comprehensive LLM Evaluation in Russian [43.02318109348788]
We introduce an open Multimodal Evaluation of Russian-language Architectures (MERA) benchmark for evaluating foundation models.
The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains.
We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system.
arXiv Detail & Related papers (2024-01-09T12:55:21Z) - LLMEval: A Preliminary Study on How to Evaluate Large Language Models [47.12588320134504]
We analyze evaluation methods by comparing various criteria with both manual and automatic evaluation, utilizing onsite, crowd-sourcing, public annotators and GPT-4.
A total of 2,186 individuals participated, leading to the generation of 243,337 manual annotations and 57,511 automatic evaluation results.
arXiv Detail & Related papers (2023-12-12T16:14:43Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.