ERBench: An Entity-Relationship based Automatically Verifiable
Hallucination Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2403.05266v1
- Date: Fri, 8 Mar 2024 12:42:36 GMT
- Title: ERBench: An Entity-Relationship based Automatically Verifiable
Hallucination Benchmark for Large Language Models
- Authors: Jio Oh, Soyeon Kim, Junseok Seo, Jindong Wang, Ruochen Xu, Xing Xie,
Steven Euijong Whang
- Abstract summary: Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue.
We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description.
We propose ERBench to automatically convert any relational database into a benchmark based on the entity-relationship (ER) model.
- Score: 48.38966595131693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved unprecedented performance in
various applications, yet their evaluation remains a critical issue. Existing
hallucination benchmarks are either static or lack adjustable complexity for
thorough analysis. We contend that utilizing existing relational databases is a
promising approach for constructing benchmarks due to their accurate knowledge
description via functional dependencies. We propose ERBench to automatically
convert any relational database into a benchmark based on the
entity-relationship (ER) model. Our key idea is to construct questions using
the database schema, records, and functional dependencies such that they can be
automatically verified. In addition, we use foreign key constraints to join
relations and construct multihop questions, which can be arbitrarily complex
and used to debug the intermediate answers of LLMs. Finally, ERBench supports
continuous evaluation, multimodal questions, and various prompt engineering
techniques. In our experiments, we construct an LLM benchmark using databases
of multiple domains and make an extensive comparison of contemporary LLMs. We
observe that better LLMs like GPT-4 can handle a larger variety of question
types, but are by no means perfect. Also, correct answers do not necessarily
imply correct rationales, which is an important evaluation that ERBench does
better than other benchmarks for various question types. Code is available at
https: //github.com/DILAB-KAIST/ERBench.
Related papers
- Relational Database Augmented Large Language Model [59.38841050766026]
Large language models (LLMs) excel in many natural language processing (NLP) tasks.
They can only incorporate new knowledge through training or supervised fine-tuning processes.
This precise, up-to-date, and private information is typically stored in relational databases.
arXiv Detail & Related papers (2024-07-21T06:19:10Z) - Lucy: Think and Reason to Solve Text-to-SQL [12.52968634440807]
Large Language Models (LLMs) have made significant progress in assisting users to query databases in natural language.
LLMs provide state-of-the-art results on many standard benchmarks, but their performance significantly drops when applied to large enterprise databases.
We propose a new solution that combines the power of LLMs in understanding questions with automated reasoning techniques to handle complex database constraints.
arXiv Detail & Related papers (2024-07-06T18:56:42Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark that demands a meta reasoning skill.
MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and K nowledge Bases.
Our benchmark covers three domains/datasets: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models [37.34801677290571]
FanOutQA is a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia.
We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B.
arXiv Detail & Related papers (2024-02-21T20:30:45Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Towards Realistic Optimization Benchmarks: A Questionnaire on the
Properties of Real-World Problems [2.805617945875364]
This work aims to identify properties of real-world problems through a questionnaire.
A few challenges that have to be considered in the design of realistic benchmarks can already be identified.
A key point for future work is to gather more responses to the questionnaire.
arXiv Detail & Related papers (2020-04-14T10:04:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.