Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions
- URL: http://arxiv.org/abs/2603.00537v1
- Date: Sat, 28 Feb 2026 08:21:44 GMT
- Title: Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions
- Authors: Atsuki Sato, Martin Aumüller, Yusuke Matsui,
- Abstract summary: Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models.<n>Recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance.<n>We provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs.
- Score: 19.804639010699997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.
Related papers
- The Tail Tells All: Estimating Model-Level Membership Inference Vulnerability Without Reference Models [8.453525669833853]
We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models.<n>We show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference.
arXiv Detail & Related papers (2025-10-22T17:03:55Z) - Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models [12.524811181751577]
We propose a novel optimization for adversarial attacks against BCSD models.<n>In particular, we aim to improve the attacks in a challenging scenario, where the attack goal is to limit the model predictions to a specific range.<n>Our attack leverages the superior capability of black-box, model-agnostic explainers in interpreting the model decision boundaries.
arXiv Detail & Related papers (2025-06-05T08:29:19Z) - Adversarial Training for Defense Against Label Poisoning Attacks [53.893792844055106]
Label poisoning attacks pose significant risks to machine learning models.<n>We propose a novel adversarial training defense strategy based on support vector machines (SVMs) to counter these threats.<n>Our approach accommodates various model architectures and employs a projected gradient descent algorithm with kernel SVMs for adversarial training.
arXiv Detail & Related papers (2025-02-24T13:03:19Z) - Generating Poisoning Attacks against Ridge Regression Models with Categorical Features [0.0]
Machine Learning (ML) models have become a very powerful tool to extract information from large datasets.<n>ML models can be vulnerable to external attacks, causing them to underperform or deviate from their expected tasks.<n>In this paper, we propose to generate strong attacks for a ridge regression model containing both numerical categorical features that explicitly poisons categorical features.
arXiv Detail & Related papers (2025-01-13T12:40:52Z) - Structured Radial Basis Function Network: Modelling Diversity for
Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions.
A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems.
It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z) - Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework.
Our importance weights are obtained by optimizing the KL-divergence regularized loss function.
Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z) - On Practical Aspects of Aggregation Defenses against Data Poisoning
Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning.
Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness.
Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z) - On the Robustness of Random Forest Against Untargeted Data Poisoning: An
Ensemble-Based Approach [42.81632484264218]
In machine learning models, perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy.
This paper aims to implement a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks.
arXiv Detail & Related papers (2022-09-28T11:41:38Z) - Testing the Robustness of Learned Index Structures [15.472214703318805]
This work evaluates the robustness of learned index structures in the presence of adversarial workloads.
To simulate adversarial workloads, we carry out a data poisoning attack on linear regression models.
We show that learned index structures can suffer from a significant performance deterioration of up to 20% when evaluated on poisoned vs. non-poisoned datasets.
arXiv Detail & Related papers (2022-07-23T18:44:54Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z) - The Price of Tailoring the Index to Your Data: Poisoning Attacks on
Learned Index Structures [9.567119607658299]
We present the first study of poisoning attacks on learned index structures.
We formulate the first poisoning attacks on linear regression models trained on a cumulative distribution function.
We generalize our poisoning techniques to attack a more advanced two-stage design of learned index structures.
arXiv Detail & Related papers (2020-08-01T17:12:04Z) - Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples.
Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks.
In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.