Evaluating the Symbol Binding Ability of Large Language Models for
Multiple-Choice Questions in Vietnamese General Education
- URL: http://arxiv.org/abs/2310.12059v3
- Date: Thu, 16 Nov 2023 14:04:15 GMT
- Title: Evaluating the Symbol Binding Ability of Large Language Models for
Multiple-Choice Questions in Vietnamese General Education
- Authors: Duc-Vu Nguyen, Quoc-Nam Nguyen
- Abstract summary: We evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings.
This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict style.
- Score: 0.16317061277457
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we evaluate the ability of large language models (LLMs) to
perform multiple choice symbol binding (MCSB) for multiple choice question
answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus
on Vietnamese, with fewer challenging MCQA datasets than in English. The two
existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent
research in Vietnamese natural language processing (NLP) has focused on the
Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to
2023 to evaluate ChatGPT. However, these studies have mainly focused on how
ChatGPT solves the VNHSGE step by step. We aim to create a novel and
high-quality dataset by providing structured guidelines for typing LaTeX
formulas for mathematics, physics, chemistry, and biology. This dataset can be
used to evaluate the MCSB ability of LLMs and smaller language models (LMs)
because it is typed in a strict LaTeX style. We focus on predicting the
character (A, B, C, or D) that is the most likely answer to a question, given
the context of the question. Our evaluation of six well-known LLMs, namely
BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the
ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising
results on the MCSB ability of LLMs for Vietnamese. The dataset is available
for research purposes only.
Related papers
- Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models.
Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark.
We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z) - Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English.
A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training.
In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons.
We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs.
Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z) - CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations.
This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs)
We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z) - VNHSGE: VietNamese High School Graduation Examination Dataset for Large
Language Models [0.0]
This article introduces the VNHSGE dataset, developed exclusively for evaluating large language models (LLMs)
The dataset covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests.
300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics.
arXiv Detail & Related papers (2023-05-20T14:13:08Z) - A Multiple Choices Reading Comprehension Corpus for Vietnamese Language
Education [2.5199066832791535]
ViMMRC 2.0 is an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks.
This dataset has 699 reading passages which are prose and poems, and 5,273 questions.
Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models.
arXiv Detail & Related papers (2023-03-31T15:54:54Z) - VLSP 2021 Shared Task: Vietnamese Machine Reading Comprehension [2.348805691644086]
This article presents details of the organization of the shared task, an overview of the methods employed by shared-task participants, and the results.
We provide a benchmark dataset named UIT-ViQuAD 2.0 for evaluating the MRC task and question answering systems for the Vietnamese language.
The UIT-ViQuAD 2.0 dataset motivates more researchers to explore Vietnamese machine reading comprehension, question answering, and question generation.
arXiv Detail & Related papers (2022-03-22T00:44:41Z) - Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG)
A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection.
We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z) - Enhancing lexical-based approach with external knowledge for Vietnamese
multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts.
We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text.
Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.