VNHSGE: VietNamese High School Graduation Examination Dataset for Large
Language Models
- URL: http://arxiv.org/abs/2305.12199v1
- Date: Sat, 20 May 2023 14:13:08 GMT
- Title: VNHSGE: VietNamese High School Graduation Examination Dataset for Large
Language Models
- Authors: Dao Xuan-Quy and Le Ngoc-Bich and Vo The-Duy and Phan Xuan-Dung and
Ngo Bac-Bien and Nguyen Van-Tien and Nguyen Thi-My-Thanh and Nguyen
Hong-Phuoc
- Abstract summary: This article introduces the VNHSGE dataset, developed exclusively for evaluating large language models (LLMs)
The dataset covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests.
300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The VNHSGE (VietNamese High School Graduation Examination) dataset, developed
exclusively for evaluating large language models (LLMs), is introduced in this
article. The dataset, which covers nine subjects, was generated from the
Vietnamese National High School Graduation Examination and comparable tests.
300 literary essays have been included, and there are over 19,000
multiple-choice questions on a range of topics. The dataset assesses LLMs in
multitasking situations such as question answering, text generation, reading
comprehension, visual question answering, and more by including both textual
data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on
the VNHSGE dataset and contrasted their performance with that of Vietnamese
students to see how well they performed. The results show that ChatGPT and
BingChat both perform at a human level in a number of areas, including
literature, English, history, geography, and civics education. They still have
space to grow, though, especially in the areas of mathematics, physics,
chemistry, and biology. The VNHSGE dataset seeks to provide an adequate
benchmark for assessing the abilities of LLMs with its wide-ranging coverage
and variety of activities. We intend to promote future developments in the
creation of LLMs by making this dataset available to the scientific community,
especially in resolving LLMs' limits in disciplines involving mathematics and
the natural sciences.
Related papers
- WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.
We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Cross-Data Knowledge Graph Construction for LLM-enabled Educational Question-Answering System: A Case Study at HCMUT [2.8000537365271367]
Large language models (LLMs) have emerged as a vibrant research topic.
LLMs face challenges in remembering events, incorporating new information, and addressing domain-specific issues or hallucinations.
This article proposes a method for automatically constructing a Knowledge Graph from multiple data sources.
arXiv Detail & Related papers (2024-04-14T16:34:31Z) - EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models [29.31649801849329]
EXAMS-V is a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models.
It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies.
The questions come in 11 languages from 7 language families.
arXiv Detail & Related papers (2024-03-15T15:08:39Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset [0.0]
This paper investigates the potential for the newest version of Large Language Models (LLMs) to be used in short answer questions for formative assessments.
It introduces a novel dataset of short answer reading comprehension questions, drawn from a set of reading assessments conducted with over 150 students in Ghana.
The paper empirically evaluates how well various configurations of generative LLMs grade student short answer responses compared to expert human raters.
arXiv Detail & Related papers (2023-10-26T17:05:40Z) - Evaluating the Symbol Binding Ability of Large Language Models for
Multiple-Choice Questions in Vietnamese General Education [0.16317061277457]
We evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings.
This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict style.
arXiv Detail & Related papers (2023-10-18T15:48:07Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations.
This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs)
We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.