AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
- URL: http://arxiv.org/abs/2407.08351v1
- Date: Thu, 11 Jul 2024 10:03:47 GMT
- Title: AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
- Authors: Xiang Lisa Li, Evan Zheran Liu, Percy Liang, Tatsunori Hashimoto,
- Abstract summary: We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
- Score: 84.65095045762524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.
Related papers
- TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - An Evaluation Framework for Attributed Information Retrieval using Large Language Models [5.216296688442701]
We propose a framework to evaluate and benchmark attributed information seeking.
Experiments using HAGRID, an attributed information-seeking dataset, show the impact of different scenarios on the correctness and attributability of answers.
arXiv Detail & Related papers (2024-09-12T12:57:08Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French.
We compare 51 carefully selected embedding models on a large scale.
We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z) - Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource
Agglutinative Data-to-Text Generation [9.80836683456026]
We tackle data-to-text for isiXhosa, which is low-resource and agglutinative.
We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG.
We develop an evaluation framework for T2X that measures how accurately generated text describes the data.
arXiv Detail & Related papers (2024-03-12T11:53:27Z) - Enhancing Retrieval Processes for Language Generation with Augmented
Queries [0.0]
This research focuses on addressing this issue through Retrieval-Augmented Generation (RAG), a technique that guides models to give accurate responses based on real facts.
To overcome scalability issues, the study explores connecting user queries with sophisticated language models such as BERT and Orca2.
The empirical results indicate a significant improvement in the initial language model's performance under RAG.
arXiv Detail & Related papers (2024-02-06T13:19:53Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.