VinaLLaMA: LLaMA-based Vietnamese Foundation Model
- URL: http://arxiv.org/abs/2312.11011v1
- Date: Mon, 18 Dec 2023 08:27:33 GMT
- Title: VinaLLaMA: LLaMA-based Vietnamese Foundation Model
- Authors: Quan Nguyen, Huy Pham and Dung Dao
- Abstract summary: VinaLLaMA is an open-weight, state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built upon LLaMA-2 with an additional 800 billion trained tokens.
VinaLLaMA-7B-chat, trained on 1 million high-quality synthetic samples, achieves SOTA results on key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese.
- Score: 4.531874270358511
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this technical report, we present VinaLLaMA, an open-weight,
state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built
upon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA not only
demonstrates fluency in Vietnamese but also exhibits a profound understanding
of Vietnamese culture, making it a truly indigenous model. VinaLLaMA-7B-chat,
trained on 1 million high-quality synthetic samples, achieves SOTA results on
key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking
a significant advancement in the Vietnamese AI landscape and offering a
versatile resource for various applications.
Related papers
- Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese [0.0]
Vintern-1B is a reliable multimodal large language model (MLLM) for Vietnamese language tasks.
The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs.
Vintern-1B is small enough to fit into various on-device applications easily.
arXiv Detail & Related papers (2024-08-22T15:15:51Z) - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding"
Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages [64.10040374077994]
We introduce SEACrowd, a collaborative initiative that consolidates standardized corpora in nearly 1,000 languages across three modalities.
We assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA.
arXiv Detail & Related papers (2024-06-14T15:23:39Z) - ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models [0.0]
ViLLM-Eval is a comprehensive evaluation suite designed to measure the advanced knowledge and reasoning abilities of foundation models.
A thorough evaluation of the most advanced LLMs on ViLLM-Eval revealed that even the best performing models have significant room for improvement.
arXiv Detail & Related papers (2024-04-17T05:57:17Z) - LaVy: Vietnamese Multimodal Large Language Model [0.0]
Large Language Models (LLMs) and Multimodal Large language models (MLLMs) have taken the world by storm with impressive abilities in complex reasoning and linguistic comprehension.
There are plethora of works related to Vietnamese Large Language Models, the lack of high-quality resources in multimodality limits the progress of Vietnamese MLLMs.
In this paper, we pioneer in address this by introducing LaVy, a state-of-the-art Vietnamese MLLM, and we also introduce LaVy-Bench benchmark designated for evaluating MLLMs's understanding on Vietnamese visual language tasks.
arXiv Detail & Related papers (2024-04-11T17:09:28Z) - Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training [0.0]
vi-mistral-x is an innovative Large Language Model designed specifically for the Vietnamese language.
It utilizes a unique method of continual pre-training, based on the Mistral architecture.
It has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation.
arXiv Detail & Related papers (2024-03-20T10:14:13Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing [1.1765925931670576]
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
arXiv Detail & Related papers (2023-10-17T11:34:50Z) - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks [63.09588102724274]
We release the largest public Chinese high-quality video-language dataset named Youku-mPLUG.
Youku-mPLUG contains 10 million Chinese video-text pairs filtered from 400 million raw videos across a wide range of 45 diverse categories for large-scale pre-training.
We build the largest human-annotated Chinese benchmarks covering three popular video-language tasks of cross-modal retrieval, video captioning, and video category classification.
arXiv Detail & Related papers (2023-06-07T11:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.