Empirical Study of Text Augmentation on Social Media Text in Vietnamese
- URL: http://arxiv.org/abs/2009.12319v2
- Date: Fri, 9 Oct 2020 09:40:30 GMT
- Title: Empirical Study of Text Augmentation on Social Media Text in Vietnamese
- Authors: Son T. Luu, Kiet Van Nguyen and Ngan Luu-Thuy Nguyen
- Abstract summary: In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models.
The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset.
The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.
- Score: 3.0938904602244355
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the text classification problem, the imbalance of labels in datasets
affect the performance of the text-classification models. Practically, the data
about user comments on social networking sites not altogether appeared - the
administrators often only allow positive comments and hide negative comments.
Thus, when collecting the data about user comments on the social network, the
data is usually skewed about one label, which leads the dataset to become
imbalanced and deteriorate the model's ability. The data augmentation
techniques are applied to solve the imbalance problem between classes of the
dataset, increasing the prediction model's accuracy. In this paper, we
performed augmentation techniques on the VLSP2019 Hate Speech Detection on
Vietnamese social texts and the UIT - VSFC: Vietnamese Students' Feedback
Corpus for Sentiment Analysis. The result of augmentation increases by about
1.5% in the F1-macro score on both corpora.
Related papers
- Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language [6.200058263544999]
This study focuses on detecting bilingual hate speech in YouTube comments.
We include factors such as content similarity, definition similarity, and common hate words to measure the impact of datasets on performance.
The best performance was obtained by combining datasets from YouTube comments, Twitter, and Gab with an F1-score of 0.74 and 0.68 for English and German YouTube comments.
arXiv Detail & Related papers (2024-10-02T10:22:53Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Noisy Self-Training with Data Augmentations for Offensive and Hate
Speech Detection Tasks [3.703767478524629]
"Noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against adversarial attacks.
We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
arXiv Detail & Related papers (2023-07-31T12:35:54Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - Data Augmentation for Mental Health Classification on Social Media [0.0]
The mental disorder of online users is determined using social media posts.
The major challenge in this domain is to avail the ethical clearance for using the user generated text on social media platforms.
We have studied the effect of data augmentation techniques on domain specific user generated text for mental health classification.
arXiv Detail & Related papers (2021-12-19T05:09:01Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - A Large-scale Dataset for Hate Speech Detection on Vietnamese Social
Media Texts [0.32228025627337864]
ViHSD is a human-annotated dataset for automatically detecting hate speech on the social network.
This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE.
arXiv Detail & Related papers (2021-03-22T00:55:47Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.