Related papers: Rare Event Detection in Imbalanced Multi-Class Datasets Using an Optimal MIP-Based Ensemble Weighting Approach

Rare Event Detection in Imbalanced Multi-Class Datasets Using an Optimal MIP-Based Ensemble Weighting Approach

URL: http://arxiv.org/abs/2412.13439v3
Date: Fri, 31 Jan 2025 14:05:16 GMT
Title: Rare Event Detection in Imbalanced Multi-Class Datasets Using an Optimal MIP-Based Ensemble Weighting Approach
Authors: Georgios Tertytchny, Georgios L. Stavrinides, Maria K. Michael,
Abstract summary: Multi-class datasets are used for rare event detection in critical cyber-physical systems.<n>We propose an optimal, efficient, and adaptable mixed integer programming (MIP) ensemble weighting scheme.<n>We evaluate and compare our MIP-based method against six well-established weighting schemes.
Score: 1.2289361708127877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To address the challenges of imbalanced multi-class datasets typically used for rare event detection in critical cyber-physical systems, we propose an optimal, efficient, and adaptable mixed integer programming (MIP) ensemble weighting scheme. Our approach leverages the diverse capabilities of the classifier ensemble on a granular per class basis, while optimizing the weights of classifier-class pairs using elastic net regularization for improved robustness and generalization. Additionally, it seamlessly and optimally selects a predefined number of classifiers from a given set. We evaluate and compare our MIP-based method against six well-established weighting schemes, using representative datasets and suitable metrics, under various ensemble sizes. The experimental results reveal that MIP outperforms all existing approaches, achieving an improvement in balanced accuracy ranging from 0.99% to 7.31%, with an overall average of 4.53% across all datasets and ensemble sizes. Furthermore, it attains an overall average increase of 4.63%, 4.60%, and 4.61% in macro-averaged precision, recall, and F1-score, respectively, while maintaining computational efficiency.

Related papers

Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation [0.0]
We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain.<n>Our approach combines utility and diversity metrics to select the most informative and representative training examples.
arXiv Detail & Related papers (2025-05-02T18:20:44Z)
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space [12.583633720004118]
Data quality and diversity are key to the construction of effective instruction-tuning datasets. We introduce an efficient sampling method that selects data samples iteratively to textbfMaximize the textbfInformation textbfGain (MIG) in semantic space.
arXiv Detail & Related papers (2025-04-18T17:59:46Z)
Aioli: A Unified Optimization Framework for Language Model Data Mixing [74.50480703834508]
We show that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. We derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions.
arXiv Detail & Related papers (2024-11-08T17:50:24Z)
Ensemble Methods for Sequence Classification with Hidden Markov Models [8.241486511994202]
We present a lightweight approach to sequence classification using Ensemble Methods for Hidden Markov Models (HMMs) HMMs offer significant advantages in scenarios with imbalanced or smaller datasets due to their simplicity, interpretability, and efficiency. Our ensemble-based scoring method enables the comparison of sequences of any length and improves performance on imbalanced datasets.
arXiv Detail & Related papers (2024-09-11T20:59:32Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.<n>We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.<n>Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions. We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z)
BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules [14.788278997556606]
We introduce a new mixed-integer programming (MIP) formulation to derive the optimal classification tree. Our methodology integrates both linear metrics, including accuracy, balanced accuracy, and cost-sensitive cost, as well as nonlinear metrics such as the F1-score. The proposed models demonstrate practical solvability on real-world datasets, effectively handling sizes in the tens of thousands.
arXiv Detail & Related papers (2024-01-29T12:58:44Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Precision-Weighted Federated Learning [1.8160945635344528]
We propose a novel algorithm that takes into account the variance of the gradients when computing the weighted average of the parameters of models trained in a Federated Learning setting. Our method was evaluated using standard image classification datasets with two different data partitioning strategies (IID/non-IID) to measure the performance and speed of our method in resource-constrained environments.
arXiv Detail & Related papers (2021-07-20T17:17:10Z)
Hybrid Ensemble optimized algorithm based on Genetic Programming for imbalanced data classification [0.0]
We propose a hybrid ensemble algorithm based on Genetic Programming (GP) for two classes of imbalanced data classification. Experimental results show the performance of the proposed method on the specified data sets in the size of the training set shows 40% and 50% better accuracy than other dimensions of the minority class prediction.
arXiv Detail & Related papers (2021-06-02T14:14:38Z)
PLM: Partial Label Masking for Imbalanced Multi-label Classification [59.68444804243782]
Neural networks trained on real-world datasets with long-tailed label distributions are biased towards frequent classes and perform poorly on infrequent classes. We propose a method, Partial Label Masking (PLM), which utilizes this ratio during training. Our method achieves strong performance when compared to existing methods on both multi-label (MultiMNIST and MSCOCO) and single-label (imbalanced CIFAR-10 and CIFAR-100) image classification datasets.
arXiv Detail & Related papers (2021-05-22T18:07:56Z)
Data Dependent Randomized Smoothing [127.34833801660233]
We show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches. We get 9% and 6% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.
arXiv Detail & Related papers (2020-12-08T10:53:11Z)
Two-Step Meta-Learning for Time-Series Forecasting Ensemble [1.1278903078792915]
forecasting using an ensemble of several methods is often seen as a compromise. We propose to predict these aspects adaptively using meta-learning. The proposed approach was tested on 12561 micro-economic time-series.
arXiv Detail & Related papers (2020-11-20T18:35:02Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.