Related papers: ANML: Attribution-Native Machine Learning with Guaranteed Robustness

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

URL: http://arxiv.org/abs/2602.11690v1
Date: Thu, 12 Feb 2026 08:12:30 GMT
Title: ANML: Attribution-Native Machine Learning with Guaranteed Robustness
Authors: Oliver Zahn, Matt Beton, Simran Chana,
Abstract summary: We introduce ANML, a framework that weights training samples by four quality factors.<n>ANML achieves 33-72% error reduction over gradient-only baselines.<n> contributor-level attribution provides 1.3-5.3x greater improvement than sample-level methods.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Frontier AI systems increasingly train on specialized expert data, from clinical records to proprietary research to curated datasets, yet current training pipelines treat all samples identically. A Nobel laureate's contribution receives the same weight as an unverified submission. We introduce ANML (Attribution-Native Machine Learning), a framework that weights training samples by four quality factors: gradient-based consistency (q), verification status (v), contributor reputation (r), and temporal relevance (T). By combining what the model observes (gradient signals) with what the system knows about data provenance (external signals), ANML produces per-contributor quality weights that simultaneously improve model performance and enable downstream attribution. Across 5 datasets (178-32,561 samples), ANML achieves 33-72% error reduction over gradient-only baselines. Quality-weighted training is data-efficient: 20% high-quality data outperforms 100% uniformly weighted data by 47%. A Two-Stage Adaptive gating mechanism guarantees that ANML never underperforms the best available baseline, including under strategic joint attacks combining credential faking with gradient alignment. When per-sample detection fails against subtle corruption, contributor-level attribution provides 1.3-5.3x greater improvement than sample-level methods, with the advantage growing as corruption becomes harder to detect.

Related papers

Self-Training the Neurochaos Learning Algorithm [0.0]
This study introduces a hybrid semi-supervised learning architecture that integrates Neurochaos Learning (NL) with a threshold-based Self-Training (ST) method to overcome this constraint.<n>The proposed Self-Training Neurochaos Learning (NL+ST) architecture consistently attains superior performance gain relative to standalone ST models.
arXiv Detail & Related papers (2026-01-03T10:24:01Z)
TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning [33.47825979936341]
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs)<n>We propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones.<n>With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%)
arXiv Detail & Related papers (2025-12-15T09:03:45Z)
Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data [1.741250583668341]
We develop a Signal Fidelity Index (SFI) diagnostic data quality at the patient level in dementia.<n>We test SFI-aware calibration for improving model performance across heterogeneous datasets without outcome labels.
arXiv Detail & Related papers (2025-09-10T15:19:04Z)
Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning [50.868594148443215]
We propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples.<n>UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression.
arXiv Detail & Related papers (2025-03-13T02:21:04Z)
R+R: Security Vulnerability Dataset Quality Is Critical [0.6906005491572401]
A number of studies have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples.<n>Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete.<n>We employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data.
arXiv Detail & Related papers (2025-03-09T01:49:30Z)
Which Augmentation Should I Use? An Empirical Investigation of Augmentations for Self-Supervised Phonocardiogram Representation Learning [5.438725298163702]
Self-Supervised Learning (SSL) contrastive learning has shown promise in mitigating the issue of data scarcity.<n>Our research aims to explore and evaluate a wide range of audio-based augmentations and uncover combinations that enhance SSL model performance in PCG classification.
arXiv Detail & Related papers (2023-12-01T11:06:00Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Hierarchical Semi-Supervised Contrastive Learning for Contamination-Resistant Anomaly Detection [81.07346419422605]
Anomaly detection aims at identifying deviant samples from the normal data distribution. Contrastive learning has provided a successful way to sample representation that enables effective discrimination on anomalies. We propose a novel hierarchical semi-supervised contrastive learning framework, for contamination-resistant anomaly detection.
arXiv Detail & Related papers (2022-07-24T18:49:26Z)
Boosting Facial Expression Recognition by A Semi-Supervised Progressive Teacher [54.50747989860957]
We propose a semi-supervised learning algorithm named Progressive Teacher (PT) to utilize reliable FER datasets as well as large-scale unlabeled expression images for effective training. Experiments on widely-used databases RAF-DB and FERPlus validate the effectiveness of our method, which achieves state-of-the-art performance with accuracy of 89.57% on RAF-DB.
arXiv Detail & Related papers (2022-05-28T07:47:53Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection [10.851348154870852]
We argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process. We propose to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. With complementary features, our fusion system with only three kinds of features outperforms other systems by 22.5% for min-tDCF and 7% for EER.
arXiv Detail & Related papers (2020-06-25T17:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.