A meta-analysis on the performance of machine-learning based language models for sentiment analysis
- URL: http://arxiv.org/abs/2509.09728v1
- Date: Wed, 10 Sep 2025 10:05:32 GMT
- Title: A meta-analysis on the performance of machine-learning based language models for sentiment analysis
- Authors: Elena Rohde, Jonas Klingwort, Christian Borgs,
- Abstract summary: The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance.<n>Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization.
- Score: 0.5243460995467893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.
Related papers
- Performance of models for monitoring sustainable development goals from remote sensing: A three-level meta-regression [0.0]
Machine learning (ML) is a tool to exploit remote sensing data for the monitoring and implementation of the United Nations' Sustainable Development Goals.<n>In this paper, we report on a meta-analysis to evaluate the performance of ML applied to remote sensing data to monitor SDGs.<n>Overall accuracy was the most reported performance metric. It was analyzed using double arcsine transformation and a three-level random effects model.
arXiv Detail & Related papers (2026-01-07T15:16:26Z) - The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks [0.6263680699548958]
This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks.<n>We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R2$) and computational costs (training time, inference time, and memory usage).
arXiv Detail & Related papers (2025-06-09T22:32:51Z) - Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence.
I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z) - Unraveling overoptimism and publication bias in ML-driven science [14.38643099447636]
Recent studies suggest published performance of Machine Learning models are often overoptimistic.
We introduce a novel model for observed accuracy, integrating parametric learning curves and the aforementioned biases.
Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.
arXiv Detail & Related papers (2024-05-23T10:43:20Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Comparing hundreds of machine learning classifiers and discrete choice models in predicting travel behavior: an empirical benchmark [6.815730801645785]
Many studies have compared machine learning (ML) and discrete choice models (DCMs) in predicting travel demand.<n>These studies often lack generalizability as they compare models deterministically without considering contextual variations.<n>This benchmark study compares two large-scale data sources.
arXiv Detail & Related papers (2021-02-01T19:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.