Related papers: Unveiling Statistical Significance of Online Regression over Multiple Datasets

Unveiling Statistical Significance of Online Regression over Multiple Datasets

URL: http://arxiv.org/abs/2512.12787v1
Date: Sun, 14 Dec 2025 18:04:11 GMT
Title: Unveiling Statistical Significance of Online Regression over Multiple Datasets
Authors: Mohammad Abu-Shaira, Weishi Shi,
Abstract summary: This article examines the state-of-the-art online regression models and empirically evaluates several suitable tests.<n>For thorough evaluations, utilizing both real and synthetic datasets with 5-fold cross-validation and seed averaging ensures comprehensive assessment across various data subsets.
Score: 7.146027549101716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite extensive focus on techniques for evaluating the performance of two learning algorithms on a single dataset, the critical challenge of developing statistical tests to compare multiple algorithms across various datasets has been largely overlooked in most machine learning research. Additionally, in the realm of Online Learning, ensuring statistical significance is essential to validate continuous learning processes, particularly for achieving rapid convergence and effectively managing concept drifts in a timely manner. Robust statistical methods are needed to assess the significance of performance differences as data evolves over time. This article examines the state-of-the-art online regression models and empirically evaluates several suitable tests. To compare multiple online regression models across various datasets, we employed the Friedman test along with corresponding post-hoc tests. For thorough evaluations, utilizing both real and synthetic datasets with 5-fold cross-validation and seed averaging ensures comprehensive assessment across various data subsets. Our tests generally confirmed the performance of competitive baselines as consistent with their individual reports. However, some statistical test results also indicate that there is still room for improvement in certain aspects of state-of-the-art methods.

Related papers

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z)
Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features [4.21051987964486]
This study proposes a method that combines federated learning and differential features to address privacy concerns.<n>To evaluate the proposed method, a model for predicting at-risk students was trained using data from 1,136 students across 12 courses conducted over 4 years.<n>The trained models were also applicable for early prediction, achieving high performance in detecting at-risk students in earlier stages of the semester.
arXiv Detail & Related papers (2025-05-14T11:12:30Z)
DUAL: Diversity and Uncertainty Active Learning for Text Summarization [5.877045865753598]
We present Diversity and Uncertainty Active Learning (DUAL), a novel algorithm that combines uncertainty and diversity to annotate samples.<n>We demonstrate thatUAL consistently matches or outperforms the best performing strategies in text summarization.
arXiv Detail & Related papers (2025-03-02T12:06:16Z)
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric [48.81957145701228]
We propose NovelSum, a new diversity metric based on sample-level "novelty"<n> Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance.
arXiv Detail & Related papers (2025-02-24T14:20:22Z)
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem. We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z)
Self-aware and Cross-sample Prototypical Learning for Semi-supervised Medical Image Segmentation [10.18427897663732]
Consistency learning plays a crucial role in semi-supervised medical image segmentation. It enables the effective utilization of limited annotated data while leveraging the abundance of unannotated data. We propose a self-aware and cross-sample prototypical learning method ( SCP-Net) to enhance the diversity of prediction in consistency learning.
arXiv Detail & Related papers (2023-05-25T16:22:04Z)
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [117.72709110877939]
Test-time adaptation (TTA) has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.<n>We categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation.
arXiv Detail & Related papers (2023-03-27T16:32:21Z)
Revisiting Long-tailed Image Classification: Survey and Benchmarks with New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution. Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z)
A classification performance evaluation measure considering data separability [6.751026374812737]
We propose a new separability measure--the rate of separability (RS)--based on the data coding rate. We demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset.
arXiv Detail & Related papers (2022-11-10T09:18:26Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.