NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear
Domain
- URL: http://arxiv.org/abs/2310.10920v1
- Date: Tue, 17 Oct 2023 01:27:20 GMT
- Title: NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear
Domain
- Authors: Anurag Acharya, Sai Munikoti, Aaron Hellinger, Sara Smith, Sridevi
Wagle, and Sameera Horawalavithana
- Abstract summary: NuclearQA is a human-made benchmark of 100 questions to evaluate language models in the nuclear domain.
We show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As LLMs have become increasingly popular, they have been used in almost every
field. But as the application for LLMs expands from generic fields to narrow,
focused science domains, there exists an ever-increasing gap in ways to
evaluate their efficacy in those fields. For the benchmarks that do exist, a
lot of them focus on questions that don't require proper understanding of the
subject in question. In this paper, we present NuclearQA, a human-made
benchmark of 100 questions to evaluate language models in the nuclear domain,
consisting of a varying collection of questions that have been specifically
designed by experts to test the abilities of language models. We detail our
approach and show how the mix of several types of questions makes our benchmark
uniquely capable of evaluating models in the nuclear domain. We also present
our own evaluation metric for assessing LLM's performances due to the
limitations of existing ones. Our experiments on state-of-the-art models
suggest that even the best LLMs perform less than satisfactorily on our
benchmark, demonstrating the scientific knowledge gap of existing LLMs.
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Evaluating the Performance of Large Language Models via Debates [43.40134389150456]
We propose an automated benchmarking framework based on debates between Large Language Models (LLMs)
This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition.
We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input.
arXiv Detail & Related papers (2024-06-16T19:02:31Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity
and Infant Care [14.326936563564171]
We present a benchmark, CARE-MI, for evaluating misinformation in large language models (LLMs)
Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models.
Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care.
arXiv Detail & Related papers (2023-07-04T03:34:19Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.