Enhancing Sentiment Analysis Results through Outlier Detection
Optimization
- URL: http://arxiv.org/abs/2311.16185v1
- Date: Sat, 25 Nov 2023 18:20:43 GMT
- Title: Enhancing Sentiment Analysis Results through Outlier Detection
Optimization
- Authors: Yuetian Chen and Mei Si
- Abstract summary: This study investigates the potential of identifying and addressing outliers in text data with subjective labels.
We utilize the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets.
- Score: 0.5439020425819
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: When dealing with text data containing subjective labels like speaker
emotions, inaccuracies or discrepancies among labelers are not uncommon. Such
discrepancies can significantly affect the performance of machine learning
algorithms. This study investigates the potential of identifying and addressing
outliers in text data with subjective labels, aiming to enhance classification
outcomes. We utilized the Deep SVDD algorithm, a one-class classification
method, to detect outliers in nine text-based emotion and sentiment analysis
datasets. By employing both a small-sized language model (DistilBERT base model
with 66 million parameters) and non-deep learning machine learning algorithms
(decision tree, KNN, Logistic Regression, and LDA) as the classifier, our
findings suggest that the removal of outliers can lead to enhanced results in
most cases. Additionally, as outliers in such datasets are not necessarily
unlearnable, we experienced utilizing a large language model -- DeBERTa v3
large with 131 million parameters, which can capture very complex patterns in
data. We continued to observe performance enhancements across multiple
datasets.
Related papers
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Empirical evaluation of shallow and deep learning classifiers for Arabic
sentiment analysis [1.1172382217477126]
This work presents a detailed comparison of the performance of deep learning models for sentiment analysis of Arabic reviews.
The datasets used in this study are multi-dialect Arabic hotel and book review datasets, which are some of the largest publicly available datasets for Arabic reviews.
Results showed deep learning outperforming shallow learning for binary and multi-label classification, in contrast with the results of similar work reported in the literature.
arXiv Detail & Related papers (2021-12-01T14:45:43Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Training Dynamic based data filtering may not work for NLP datasets [0.0]
We study the applicability of the Area Under the Margin (AUM) metric to identify mislabelled examples in NLP datasets.
We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points.
arXiv Detail & Related papers (2021-09-19T18:50:45Z) - FIND: Human-in-the-Loop Debugging Deep Text Classifiers [55.135620983922564]
We propose FIND -- a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features.
Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets.
arXiv Detail & Related papers (2020-10-10T12:52:53Z) - On the Robustness of Active Learning [0.7340017786387767]
Active Learning is concerned with how to identify the most useful samples for a Machine Learning algorithm to be trained with.
We find that it is often applied with not enough care and domain knowledge.
We propose the new "Sum of Squared Logits" method based on the Simpson diversity index and investigate the effect of using the confusion matrix for balancing in sample selection.
arXiv Detail & Related papers (2020-06-18T09:07:23Z) - Outlier Guided Optimization of Abdominal Segmentation [7.036733782879497]
We build on a pre-trained 3D U-Net model for abdominal multi-organ segmentation.
We augmented the dataset either with outlier data (e.g., exemplars for which the baseline algorithm failed) or inliers (e.g., exemplars for which the baseline algorithm worked)
We find that the marginal value of adding outliers is higher than that of adding inliers.
arXiv Detail & Related papers (2020-02-10T21:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.