SMTCE: A Social Media Text Classification Evaluation Benchmark and
BERTology Models for Vietnamese
- URL: http://arxiv.org/abs/2209.10482v1
- Date: Wed, 21 Sep 2022 16:33:46 GMT
- Title: SMTCE: A Social Media Text Classification Evaluation Benchmark and
BERTology Models for Vietnamese
- Authors: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
- Abstract summary: We introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks.
We implement and analyze the effectiveness of a variety of multilingual BERT-based models and monolingual BERT-based models for tasks in the benchmark.
It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.
- Score: 3.0938904602244355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text classification is a typical natural language processing or computational
linguistics task with various interesting applications. As the number of users
on social media platforms increases, data acceleration promotes emerging
studies on Social Media Text Classification (SMTC) or social media text mining
on these valuable resources. In contrast to English, Vietnamese, one of the
low-resource languages, is still not concentrated on and exploited thoroughly.
Inspired by the success of the GLUE, we introduce the Social Media Text
Classification Evaluation (SMTCE) benchmark, as a collection of datasets and
models across a diverse set of SMTC tasks. With the proposed benchmark, we
implement and analyze the effectiveness of a variety of multilingual BERT-based
models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models
(PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark.
Monolingual models outperform multilingual models and achieve state-of-the-art
results on all text classification tasks. It provides an objective assessment
of multilingual and monolingual BERT-based models on the benchmark, which will
benefit future studies about BERTology in the Vietnamese language.
Related papers
- A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Comparison of Pre-trained Language Models for Turkish Address Parsing [0.0]
We focus on Turkish maps data and thoroughly evaluate both multilingual and Turkish based BERT, DistilBERT, ELECTRA and RoBERTa.
We also propose a MultiLayer Perceptron (MLP) for fine-tuning BERT in addition to the standard approach of one-layer fine-tuning.
arXiv Detail & Related papers (2023-06-24T12:09:43Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Evaluating Multilingual BERT for Estonian [0.8057006406834467]
We evaluate four multilingual models -- multilingual BERT, multilingual distilled BERT, XLM and XLM-RoBERTa -- on several NLP tasks.
Our results show that multilingual BERT models can generalise well on different Estonian NLP tasks.
arXiv Detail & Related papers (2020-10-01T14:48:31Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z) - Unified Multi-Criteria Chinese Word Segmentation with BERT [82.16846720508748]
Multi-Criteria Chinese Word aims at finding word boundaries in a Chinese sentence composed of continuous characters.
In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT.
Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS.
arXiv Detail & Related papers (2020-04-13T07:50:04Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.