The Elicitation Game: Evaluating Capability Elicitation Techniques
- URL: http://arxiv.org/abs/2502.02180v1
- Date: Tue, 04 Feb 2025 09:54:24 GMT
- Title: The Elicitation Game: Evaluating Capability Elicitation Techniques
- Authors: Felix Hofstätter, Teun van der Weij, Jayden Teoh, Henning Bartsch, Francis Rhys Ward,
- Abstract summary: We evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms.
We introduce a novel method for training model organisms, based on circuit breaking.
For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism.
- Score: 1.064108398661507
- License:
- Abstract: Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system's capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms -- language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.
Related papers
- Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning.
Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations.
Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z) - Model Developmental Safety: A Retention-Centric Method and Applications in Vision-Language Models [75.8161094916476]
We study how to develop a pretrained vision-language model, specifically the CLIP model, for acquiring new capabilities or improving existing capabilities of image classification.
Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.
arXiv Detail & Related papers (2024-10-04T22:34:58Z) - Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? [3.258629327038072]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks.
Yet, the potential for generating harmful content through these models seems to persist.
This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers.
arXiv Detail & Related papers (2024-08-05T17:27:29Z) - AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustworthy capability evaluations are crucial for ensuring the safety of AI systems.
Developers of AI systems may have incentives for evaluations to understate the AI's actual capability.
In this paper we assess sandbagging capabilities in contemporary language models.
arXiv Detail & Related papers (2024-06-11T15:26:57Z) - Stress-Testing Capability Elicitation With Password-Locked Models [6.6380867311877605]
We investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities.
We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities.
When only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities.
arXiv Detail & Related papers (2024-05-29T22:26:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for
Instruction Generation Models [5.975913042883176]
Recent work studies the cognitive capabilities of language models through psychological tests designed for humans.
We formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks.
arXiv Detail & Related papers (2022-12-21T04:43:19Z) - Few-shot Prompting Towards Controllable Response Generation [49.479958672988566]
We first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters.
We apply multi-task learning to make the model learn to generalize to new tasks better.
Experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters.
arXiv Detail & Related papers (2022-06-08T14:48:06Z) - Multicriteria interpretability driven Deep Learning [0.0]
Deep Learning methods are renowned for their performances, yet their lack of interpretability prevents them from high-stakes contexts.
Recent model methods address this problem by providing post-hoc interpretability methods by reverse-engineering the model's inner workings.
We propose a Multicriteria agnostic technique that allows to control the feature effects on the model's outcome by injecting knowledge in the objective function.
arXiv Detail & Related papers (2021-11-28T09:41:13Z) - Evaluating the Safety of Deep Reinforcement Learning Models using
Semi-Formal Verification [81.32981236437395]
We present a semi-formal verification approach for decision-making tasks based on interval analysis.
Our method obtains comparable results over standard benchmarks with respect to formal verifiers.
Our approach allows to efficiently evaluate safety properties for decision-making models in practical applications.
arXiv Detail & Related papers (2020-10-19T11:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.