What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
- URL: http://arxiv.org/abs/2504.21042v1
- Date: Mon, 28 Apr 2025 13:30:48 GMT
- Title: What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift
- Authors: Jiamin Chang, Haoyang Li, Hammond Pearce, Ruoxi Sun, Bo Li, Minhui Xue,
- Abstract summary: ConceptLens is a generic framework that leverages pre-trained multimodal models to identify integrity threats.<n>It uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts.<n>It uncovers sociological biases in generative content, revealing disparities across sociological contexts.
- Score: 33.83306492023009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing adoption of artificial intelligence (AI) has amplified concerns about trustworthiness, including integrity, privacy, robustness, and bias. To assess and attribute these threats, we propose ConceptLens, a generic framework that leverages pre-trained multimodal models to identify the root causes of integrity threats by analyzing Concept Shift in probing samples. ConceptLens demonstrates strong detection performance for vanilla data poisoning attacks and uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts. It identifies privacy risks in unaltered but high-risk samples, filters them before training, and provides insights into model weaknesses arising from incomplete or imbalanced training data. Additionally, at the model level, it attributes concepts that the target model is overly dependent on, identifies misleading concepts, and explains how disrupting key concepts negatively impacts the model. Furthermore, it uncovers sociological biases in generative content, revealing disparities across sociological contexts. Strikingly, ConceptLens reveals how safe training and inference data can be unintentionally and easily exploited, potentially undermining safety alignment. Our study informs actionable insights to breed trust in AI systems, thereby speeding adoption and driving greater innovation.
Related papers
- On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse.
We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space.
Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook [54.24701201956833]
Security and privacy issues have undermined users' confidence in pre-trained models.
Current literature lacks a clear taxonomy of emerging attacks and defenses for pre-trained models.
This taxonomy categorizes attacks and defenses into No-Change, Input-Change, and Model-Change approaches.
arXiv Detail & Related papers (2024-11-12T10:15:33Z) - Black-box Adversarial Transferability: An Empirical Study in Cybersecurity Perspective [0.0]
In adversarial machine learning, malicious users try to fool the deep learning model by inserting adversarial perturbation inputs into the model during its training or testing phase.
We empirically test the black-box adversarial transferability phenomena in cyber attack detection systems.
The results indicate that any deep learning model is highly susceptible to adversarial attacks, even if the attacker does not have access to the internal details of the target model.
arXiv Detail & Related papers (2024-04-15T06:56:28Z) - Machine Learning Robustness: A Primer [12.426425119438846]
The discussion begins with a detailed definition of robustness, portraying it as the ability of ML models to maintain stable performance across varied and unexpected environmental conditions.
The chapter delves into the factors that impede robustness, such as data bias, model complexity, and the pitfalls of underspecified ML pipelines.
The discussion progresses to explore amelioration strategies for bolstering robustness, starting with data-centric approaches like debiasing and augmentation.
arXiv Detail & Related papers (2024-04-01T03:49:42Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Understanding and Enhancing Robustness of Concept-based Models [41.20004311158688]
We study robustness of concept-based models to adversarial perturbations.
In this paper, we first propose and analyze different malicious attacks to evaluate the security vulnerability of concept based models.
We then propose a potential general adversarial training-based defense mechanism to increase robustness of these systems to the proposed malicious attacks.
arXiv Detail & Related papers (2022-11-29T10:43:51Z) - FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual
Robustness [56.263482420177915]
We study the faithfulness of existing systems from a new perspective of factual robustness.
We propose a novel training strategy, namely FRSUM, which teaches the model to defend against both explicit adversarial samples and implicit factual adversarial perturbations.
arXiv Detail & Related papers (2022-11-01T06:09:00Z) - Disentangled Text Representation Learning with Information-Theoretic
Perspective for Adversarial Robustness [17.5771010094384]
Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems.
Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training.
In this paper, we tackle the adversarial challenge from the view of disentangled representation learning.
arXiv Detail & Related papers (2022-10-26T18:14:39Z) - Robust Transferable Feature Extractors: Learning to Defend Pre-Trained
Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors.
We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.