SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
- URL: http://arxiv.org/abs/2406.10421v3
- Date: Wed, 02 Oct 2024 14:20:50 GMT
- Title: SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
- Authors: Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon Reiß, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao, Tobias Röddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues,
- Abstract summary: One common use of Large Language Models (LLMs) is performing tasks on scientific topics.
Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions.
We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
- Score: 100.02175403852253
- License:
- Abstract: With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
Related papers
- CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline [42.61538071832468]
Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving.
We tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment.
arXiv Detail & Related papers (2024-04-03T17:51:18Z) - Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange [25.419977967846144]
Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks.
This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
arXiv Detail & Related papers (2024-03-30T12:48:31Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - "Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students [2.6043678412433713]
This study evaluates the effectiveness of large language models (LLMs) in performing tasks common among undergraduate computer science students.
Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot Chat.
arXiv Detail & Related papers (2024-01-22T15:11:36Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - Can Large Language Models Transform Computational Social Science? [79.62471267510963]
Large Language Models (LLMs) are capable of performing many language processing tasks zero-shot (without training data)
This work provides a road map for using LLMs as Computational Social Science tools.
arXiv Detail & Related papers (2023-04-12T17:33:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.