Related papers: Evaluating the Prompt Steerability of Large Language Models

Evaluating the Prompt Steerability of Large Language Models

URL: http://arxiv.org/abs/2411.12405v1
Date: Tue, 19 Nov 2024 10:41:54 GMT
Title: Evaluating the Prompt Steerability of Large Language Models
Authors: Erik Miehling, Michael Desmond, Karthikeyan Natesan Ramamurthy, Elizabeth M. Daly, Pierre Dognin, Jesus Rios, Djallel Bouneffouf, Miao Liu,
Abstract summary: We propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our benchmark reveals that the steerability of many current models is limited due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions.
Score: 16.341817101388454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline behavior. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Related papers

Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Cross Feature Selection to Eliminate Spurious Interactions and Single Feature Dominance Explainable Boosting Machines [0.0]
Interpretability is essential for legal, ethical, and practical reasons. High-performance models can suffer from spurious interactions with redundant features and single-feature dominance. In this paper, we explore novel approaches to address these issues by utilizing alternate Cross-feature selection, ensemble features and model configuration alteration techniques.
arXiv Detail & Related papers (2023-07-17T13:47:41Z)
A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark. Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z)
Measuring the Driving Forces of Predictive Performance: Application to Credit Scoring [0.0]
In credit scoring, machine learning models are known to outperform standard parametric models. We introduce the XPER methodology to decompose a performance metric into contributions associated with a model. We show that a small number of features can explain a surprisingly large part of the model performance.
arXiv Detail & Related papers (2022-12-12T13:09:46Z)
Explainable Human-in-the-loop Dynamic Data-Driven Digital Twins [6.657586324950896]
Digital Twins (DT) are essentially Dynamic Data-driven models that serve as real-time symbiotic "virtual replicas" of real-world systems. This paper is an approach to harnessing explainability in human-in-the-loop DDDAS and DT systems, leveraging bidirectional symbiotic sensing feedback.
arXiv Detail & Related papers (2022-07-19T07:15:12Z)
Using Shape Metrics to Describe 2D Data Points [0.0]
We propose to use shape metrics to describe 2D data to help make analyses more explainable and interpretable. This is particularly important in applications in the medical community where the right to explainability' is crucial.
arXiv Detail & Related papers (2022-01-27T23:28:42Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
To what extent do human explanations of model behavior align with actual model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words. We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z)
Plausible Counterfactuals: Auditing Deep Learning Classifiers with Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data. Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model. Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.