ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing
- URL: http://arxiv.org/abs/2310.11166v2
- Date: Sat, 28 Oct 2023 16:22:19 GMT
- Title: ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing
- Authors: Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen
- Abstract summary: We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
- Score: 1.1765925931670576
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: English and Chinese, known as resource-rich languages, have witnessed the
strong development of transformer-based language models for natural language
processing tasks. Although Vietnam has approximately 100M people speaking
Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA,
performed well on general Vietnamese NLP tasks, including POS tagging and named
entity recognition. These pre-trained language models are still limited to
Vietnamese social media tasks. In this paper, we present the first monolingual
pre-trained language model for Vietnamese social media texts, ViSoBERT, which
is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese
social media texts using XLM-R architecture. Moreover, we explored our
pre-trained model on five important natural language downstream tasks on
Vietnamese social media texts: emotion recognition, hate speech detection,
sentiment analysis, spam reviews detection, and hate speech spans detection.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses
the previous state-of-the-art models on multiple Vietnamese social media tasks.
Our ViSoBERT model is available only for research purposes.
Related papers
- mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
We introduce mOSCAR, the first large-scale multilingual and multimodal document corpus crawled from the web.
It covers 163 languages, 315M documents, 214B tokens and 1.2B images.
It shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks.
arXiv Detail & Related papers (2024-06-13T00:13:32Z) - VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding [1.813644606477824]
We introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark.
The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding.
We present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark.
arXiv Detail & Related papers (2024-03-23T16:26:49Z) - Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training [0.0]
vi-mistral-x is an innovative Large Language Model designed specifically for the Vietnamese language.
It utilizes a unique method of continual pre-training, based on the Mistral architecture.
It has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation.
arXiv Detail & Related papers (2024-03-20T10:14:13Z) - VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.
The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.
In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - ViDeBERTa: A powerful pre-trained language model for Vietnamese [10.000783498978604]
This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese.
Three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large - are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts.
We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering.
arXiv Detail & Related papers (2023-01-25T07:26:54Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - PhoBERT: Pre-trained language models for Vietnamese [11.685916685552982]
We present PhoBERT, the first public large-scale monolingual language models pre-trained for Vietnamese.
Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R.
We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP.
arXiv Detail & Related papers (2020-03-02T10:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.