Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density
- URL: http://arxiv.org/abs/2111.01689v2
- Date: Wed, 3 Nov 2021 01:46:27 GMT
- Title: Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density
- Authors: Juuso Eronen, Michal Ptaszynski, Fumito Masui, Aleksander
Smywi\'nski-Pohl, Gniewosz Leliwa, Michal Wroczynski
- Abstract summary: We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
- Score: 58.64907136562178
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study the effectiveness of Feature Density (FD) using different
linguistically-backed feature preprocessing methods in order to estimate
dataset complexity, which in turn is used to comparatively estimate the
potential performance of machine learning (ML) classifiers prior to any
training. We hypothesise that estimating dataset complexity allows for the
reduction of the number of required experiments iterations. This way we can
optimize the resource-intensive training of ML models which is becoming a
serious issue due to the increases in available dataset sizes and the ever
rising popularity of models based on Deep Neural Networks (DNN). The problem of
constantly increasing needs for more powerful computational resources is also
affecting the environment due to alarmingly-growing amount of CO2 emissions
caused by training of large-scale ML models. The research was conducted on
multiple datasets, including popular datasets, such as Yelp business review
dataset used for training typical sentiment analysis models, as well as more
recent datasets trying to tackle the problem of cyberbullying, which, being a
serious social problem, is also a much more sophisticated problem form the
point of view of linguistic representation. We use cyberbullying datasets
collected for multiple languages, namely English, Japanese and Polish. The
difference in linguistic complexity of datasets allows us to additionally
discuss the efficacy of linguistically-backed word preprocessing.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Few-shot learning for automated content analysis: Efficient coding of
arguments and claims in the debate on arms deliveries to Ukraine [0.9576975587953563]
Pre-trained language models (PLM) based on transformer neural networks offer great opportunities to improve automatic content analysis in communication science.
Three characteristics so far impeded the widespread adoption of the methods in the applying disciplines: the dominance of English language models in NLP research, the necessary computing resources, and the effort required to produce training data to fine-tune PLMs.
We test our approach on a realistic use case from communication science to automatically detect claims and arguments together with their stance in the German news debate on arms deliveries to Ukraine.
arXiv Detail & Related papers (2023-12-28T11:39:08Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Exploring the Potential of Feature Density in Estimating Machine
Learning Classifier Performance with Application to Cyberbullying Detection [2.4674086273775035]
We analyze the potential of Feature Density (HD) as a way to comparatively estimate machine learning (ML) classifier performance prior to training.
Our approach 1s to optimize the resource-intensive training of ML models for Natural Language Processing to reduce the number of required experiments.
arXiv Detail & Related papers (2022-06-04T09:11:13Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Improving Commonsense Causal Reasoning by Adversarial Training and Data
Augmentation [14.92157586545743]
This paper presents a number of techniques for making models more robust in the domain of causal reasoning.
We show a statistically significant improvement on performance and on both datasets, even with only a small number of additionally generated data points.
arXiv Detail & Related papers (2021-01-13T09:55:29Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.