Is word segmentation necessary for Vietnamese sentiment classification?
- URL: http://arxiv.org/abs/2301.00418v1
- Date: Sun, 1 Jan 2023 15:04:47 GMT
- Title: Is word segmentation necessary for Vietnamese sentiment classification?
- Authors: Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen
- Abstract summary: This paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification.
We presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation.
In this way, the RDRsegmenter is the stable toolkit for word segmentation among the uitnlp, pyvi, and underthesea toolkits.
- Score: 0.30458514384586405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To the best of our knowledge, this paper made the first attempt to answer
whether word segmentation is necessary for Vietnamese sentiment classification.
To do this, we presented five pre-trained monolingual S4- based language models
for Vietnamese, including one model without word segmentation, and four models
using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing
data phase. According to comprehensive experimental results on two corpora,
including the VLSP2016-SA corpus of technical article reviews from the news and
social media and the UIT-VSFC corpus of the educational survey, we have two
suggestions. Firstly, using traditional classifiers like Naive Bayes or Support
Vector Machines, word segmentation maybe not be necessary for the Vietnamese
sentiment classification corpus, which comes from the social domain. Secondly,
word segmentation is necessary for Vietnamese sentiment classification when
word segmentation is used before using the BPE method and feeding into the deep
learning model. In this way, the RDRsegmenter is the stable toolkit for word
segmentation among the uitnlp, pyvi, and underthesea toolkits.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Analyzing Vietnamese Legal Questions Using Deep Neural Networks with
Biaffine Classifiers [3.116035935327534]
We propose using deep neural networks to extract important information from Vietnamese legal questions.
Given a legal question in natural language, the goal is to extract all the segments that contain the needed information to answer the question.
arXiv Detail & Related papers (2023-04-27T18:19:24Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z) - Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix
Capture [2.7528170226206443]
We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes.
Our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.
arXiv Detail & Related papers (2020-06-14T05:19:46Z) - Incorporating Uncertain Segmentation Information into Chinese NER for
Social Media Text [18.455836845989523]
segmentation error propagation is a challenge for Chinese named entity recognition systems.
We propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text.
arXiv Detail & Related papers (2020-04-14T09:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.