Related papers: Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education

URL: http://arxiv.org/abs/2310.12059v3
Date: Thu, 16 Nov 2023 14:04:15 GMT
Title: Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education
Authors: Duc-Vu Nguyen, Quoc-Nam Nguyen
Abstract summary: We evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict style.
Score: 0.16317061277457
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we evaluate the ability of large language models (LLMs) to perform multiple choice symbol binding (MCSB) for multiple choice question answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus on Vietnamese, with fewer challenging MCQA datasets than in English. The two existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent research in Vietnamese natural language processing (NLP) has focused on the Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to 2023 to evaluate ChatGPT. However, these studies have mainly focused on how ChatGPT solves the VNHSGE step by step. We aim to create a novel and high-quality dataset by providing structured guidelines for typing LaTeX formulas for mathematics, physics, chemistry, and biology. This dataset can be used to evaluate the MCSB ability of LLMs and smaller language models (LMs) because it is typed in a strict LaTeX style. We focus on predicting the character (A, B, C, or D) that is the most likely answer to a question, given the context of the question. Our evaluation of six well-known LLMs, namely BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising results on the MCSB ability of LLMs for Vietnamese. The dataset is available for research purposes only.

Related papers

An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
On Bilingual Lexicon Induction with Large Language Models [81.6546357879259]
We examine the potential of the latest generation of Large Language Models for the development of bilingual lexicons. We study 1) zero-shot prompting for unsupervised BLI and 2) few-shot in-context prompting with a set of seed translation pairs. Our work is the first to demonstrate strong BLI capabilities of text-to-text mLLMs.
arXiv Detail & Related papers (2023-10-21T12:43:27Z)
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations. This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs) We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z)
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models [0.0]
This article introduces the VNHSGE dataset, developed exclusively for evaluating large language models (LLMs) The dataset covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics.
arXiv Detail & Related papers (2023-05-20T14:13:08Z)
A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education [2.5199066832791535]
ViMMRC 2.0 is an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models.
arXiv Detail & Related papers (2023-03-31T15:54:54Z)
VLSP 2021 Shared Task: Vietnamese Machine Reading Comprehension [2.348805691644086]
This article presents details of the organization of the shared task, an overview of the methods employed by shared-task participants, and the results. We provide a benchmark dataset named UIT-ViQuAD 2.0 for evaluating the MRC task and question answering systems for the Vietnamese language. The UIT-ViQuAD 2.0 dataset motivates more researchers to explore Vietnamese machine reading comprehension, question answering, and question generation.
arXiv Detail & Related papers (2022-03-22T00:44:41Z)
Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG) A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection. We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z)
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC) The proposed task aims to fill the right candidate sentence into the passage that has several blanks. We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z)
Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts. We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.