Exploring the Robustness of Model-Graded Evaluations and Automated
Interpretability
- URL: http://arxiv.org/abs/2312.03721v2
- Date: Fri, 8 Dec 2023 11:16:39 GMT
- Title: Exploring the Robustness of Model-Graded Evaluations and Automated
Interpretability
- Authors: Simon Lermen and Ond\v{r}ej Kvapil
- Abstract summary: Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models.
We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval.
We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been increasing interest in evaluations of language models for a
variety of risks and characteristics. Evaluations relying on natural language
understanding for grading can often be performed at scale by using other
language models. We test the robustness of these model-graded evaluations to
injections on different datasets including a new Deception Eval. These
injections resemble direct communication between the testee and the evaluator
to change their grading. We extrapolate that future, more intelligent models
might manipulate or cooperate with their evaluation model. We find significant
susceptibility to these injections in state-of-the-art commercial models on all
examined evaluations. Furthermore, similar injections can be used on automated
interpretability frameworks to produce misleading model-written explanations.
The results inspire future work and should caution against unqualified trust in
evaluations and automated interpretability.
Related papers
- ElicitationGPT: Text Elicitation Mechanisms via Language Models [12.945581341789431]
This paper develops mechanisms for scoring elicited text against ground truth text using domain-knowledge-free queries to a large language model.
An empirical evaluation is conducted on peer reviews from a peer-grading dataset and in comparison to manual instructor scores for the peer reviews.
arXiv Detail & Related papers (2024-06-13T17:49:10Z) - PRobELM: Plausibility Ranking Evaluation for Language Models [12.057770969325453]
PRobELM is a benchmark designed to assess language models' ability to discern more plausible scenarios through their parametric knowledge.
Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal bounds of the training data for the evaluated models.
arXiv Detail & Related papers (2024-04-04T21:57:11Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z) - Learning to Compare for Better Training and Evaluation of Open Domain
Natural Language Generation Models [23.62054164511058]
We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT.
While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
arXiv Detail & Related papers (2020-02-12T15:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.