A Simple and Efficient Ensemble Classifier Combining Multiple Neural
Network Models on Social Media Datasets in Vietnamese
- URL: http://arxiv.org/abs/2009.13060v2
- Date: Tue, 29 Sep 2020 01:32:26 GMT
- Title: A Simple and Efficient Ensemble Classifier Combining Multiple Neural
Network Models on Social Media Datasets in Vietnamese
- Authors: Huy Duc Huynh, Hang Thi-Thuy Do, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
- Abstract summary: This study aims to classify Vietnamese texts on social media from three different Vietnamese benchmark datasets.
Advanced deep learning models are used and optimized in this study, including CNN, LSTM, and their variants.
Our ensemble model achieves the best performance on all three datasets.
- Score: 2.7528170226206443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text classification is a popular topic of natural language processing, which
has currently attracted numerous research efforts worldwide. The significant
increase of data in social media requires the vast attention of researchers to
analyze such data. There are various studies in this field in many languages
but limited to the Vietnamese language. Therefore, this study aims to classify
Vietnamese texts on social media from three different Vietnamese benchmark
datasets. Advanced deep learning models are used and optimized in this study,
including CNN, LSTM, and their variants. We also implement the BERT, which has
never been applied to the datasets. Our experiments find a suitable model for
classification tasks on each specific dataset. To take advantage of single
models, we propose an ensemble model, combining the highest-performance models.
Our single models reach positive results on each dataset. Moreover, our
ensemble model achieves the best performance on all three datasets. We reach
86.96% of F1- score for the HSD-VLSP dataset, 65.79% of F1-score for the
UIT-VSMEC dataset, 92.79% and 89.70% for sentiments and topics on the UIT-VSFC
dataset, respectively. Therefore, our models achieve better performances as
compared to previous studies on these datasets.
Related papers
- Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese [1.7457686843484872]
We conduct experiments using various combinations of contextualized language models (CLM) and neural networks.
We find that the joint approach of CLM and neural networks is simple yet capable of achieving high-quality performance.
arXiv Detail & Related papers (2024-11-20T15:46:48Z) - FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models [48.484485609995986]
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM)
There are currently no realistic datasets and benchmarks for FedLLM.
We propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics.
arXiv Detail & Related papers (2024-06-07T11:19:30Z) - MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French.
We compare 51 carefully selected embedding models on a large scale.
We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Improving Persian Relation Extraction Models by Data Augmentation [0.0]
We present our augmented dataset and the results and findings of our system.
We use PERLEX as the base dataset and enhance it by applying some text preprocessing steps.
We then employ two different models including ParsBERT and multilingual BERT for relation extraction on the augmented PERLEX dataset.
arXiv Detail & Related papers (2022-03-29T08:08:47Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Benchmarking the Benchmark -- Analysis of Synthetic NIDS Datasets [4.125187280299247]
We analyse the statistical properties of benign traffic in three of the more recent and relevant NIDS datasets.
Our results show a distinct difference of most of the considered statistical features between the synthetic datasets and two real-world datasets.
arXiv Detail & Related papers (2021-04-19T03:17:37Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.