Related papers: ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models

URL: http://arxiv.org/abs/2404.11086v2
Date: Thu, 18 Apr 2024 07:41:23 GMT
Title: ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
Authors: Trong-Hieu Nguyen, Anh-Cuong Le, Viet-Cuong Nguyen,
Abstract summary: ViLLM-Eval is a comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The rapid advancement of large language models (LLMs) necessitates the development of new benchmarks to accurately assess their capabilities. To address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models within a Vietnamese context. ViLLM-Eval consists of multiple-choice questions and predict next word tasks spanning various difficulty levels and diverse disciplines, ranging from humanities to science and engineering. A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement in understanding and responding to Vietnamese language tasks. ViLLM-Eval is believed to be instrumental in identifying key strengths and weaknesses of foundation models, ultimately promoting their development and enhancing their performance for Vietnamese users. This paper provides a thorough overview of ViLLM-Eval as part of the Vietnamese Large Language Model shared task, held within the 10th International Workshop on Vietnamese Language and Speech Processing (VLSP 2023).

Related papers

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [60.52580061637301]
MMLU-ProX is a comprehensive benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. We evaluate 25 state-of-the-art large language models (LLMs) using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [79.03392191805028]
The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less.
arXiv Detail & Related papers (2024-12-06T16:06:08Z)
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding [1.813644606477824]
We introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. We present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark.
arXiv Detail & Related papers (2024-03-23T16:26:49Z)
Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training [0.0]
vi-mistral-x is an innovative Large Language Model designed specifically for the Vietnamese language. It utilizes a unique method of continual pre-training, based on the Mistral architecture. It has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation.
arXiv Detail & Related papers (2024-03-20T10:14:13Z)
VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z)
VinaLLaMA: LLaMA-based Vietnamese Foundation Model [4.531874270358511]
VinaLLaMA is an open-weight, state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built upon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA-7B-chat, trained on 1 million high-quality synthetic samples, achieves SOTA results on key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese.
arXiv Detail & Related papers (2023-12-18T08:27:33Z)
ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing [1.1765925931670576]
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
arXiv Detail & Related papers (2023-10-17T11:34:50Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models [58.42279750824907]
We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models.
arXiv Detail & Related papers (2023-05-15T03:20:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.