An Assessment of Model-On-Model Deception
- URL: http://arxiv.org/abs/2405.12999v1
- Date: Fri, 10 May 2024 23:24:18 GMT
- Title: An Assessment of Model-On-Model Deception
- Authors: Julius Heitkoetter, Michael Gerovitch, Laker Newhouse,
- Abstract summary: We create a dataset of over 10,000 misleading explanations by asking Llama-2 7B, 13B, 70B, and GPT-3.5 to justify the wrong answer for questions in the MMLU.
We find that, when models read these explanations, they are all significantly deceived. Worryingly, models of all capabilities are successful at misleading others, while more capable models are only slightly better at resisting deception.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The trustworthiness of highly capable language models is put at risk when they are able to produce deceptive outputs. Moreover, when models are vulnerable to deception it undermines reliability. In this paper, we introduce a method to investigate complex, model-on-model deceptive scenarios. We create a dataset of over 10,000 misleading explanations by asking Llama-2 7B, 13B, 70B, and GPT-3.5 to justify the wrong answer for questions in the MMLU. We find that, when models read these explanations, they are all significantly deceived. Worryingly, models of all capabilities are successful at misleading others, while more capable models are only slightly better at resisting deception. We recommend the development of techniques to detect and defend against deception.
Related papers
- Frontier Models are Capable of In-context Scheming [41.30527987937867]
One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives.
We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals.
We find that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.
arXiv Detail & Related papers (2024-12-06T12:09:50Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - Beyond Labeling Oracles: What does it mean to steal ML models? [52.63413852460003]
Model extraction attacks are designed to steal trained models with only query access.
We investigate factors influencing the success of model extraction attacks.
Our findings urge the community to redefine the adversarial goals of ME attacks.
arXiv Detail & Related papers (2023-10-03T11:10:21Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - MOVE: Effective and Harmless Ownership Verification via Embedded
External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.
We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.
In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Defending against Model Stealing via Verifying Embedded External
Features [90.29429679125508]
adversaries can steal' deployed models even when they have no training samples and can not get access to the model parameters or structures.
We explore the defense from another angle by verifying whether a suspicious model contains the knowledge of defender-specified emphexternal features.
Our method is effective in detecting different types of model stealing simultaneously, even if the stolen model is obtained via a multi-stage stealing process.
arXiv Detail & Related papers (2021-12-07T03:51:54Z) - TruthfulQA: Measuring How Models Mimic Human Falsehoods [2.7143159361691227]
We propose a benchmark to measure whether a language model is truthful in generating answers to questions.
The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.
The best model was truthful on 58% of questions, while human performance was 94%.
arXiv Detail & Related papers (2021-09-08T17:15:27Z) - MEGEX: Data-Free Model Extraction Attack against Gradient-Based
Explainable AI [1.693045612956149]
Deep neural networks deployed in Machine Learning as a Service (ML) face the threat of model extraction attacks.
A model extraction attack is an attack to violate intellectual property and privacy in which an adversary steals trained models in a cloud using only their predictions.
In this paper, we propose MEGEX, a data-free model extraction attack against a gradient-based explainable AI.
arXiv Detail & Related papers (2021-07-19T14:25:06Z) - Feature Attributions and Counterfactual Explanations Can Be Manipulated [32.579094387004346]
We show how adversaries can design biased models that manipulate model agnostic feature attribution methods.
These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model.
We evaluate the manipulations on real world data sets, including COMPAS and Communities & Crime, and find explanations can be manipulated in practice.
arXiv Detail & Related papers (2021-06-23T17:43:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.