Detecting Edit Failures In Large Language Models: An Improved
Specificity Benchmark
- URL: http://arxiv.org/abs/2305.17553v2
- Date: Sat, 3 Jun 2023 08:01:11 GMT
- Title: Detecting Edit Failures In Large Language Models: An Improved
Specificity Benchmark
- Authors: Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas
and Fazl Barez
- Abstract summary: We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+.
We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.
- Score: 9.45927470587879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent model editing techniques promise to mitigate the problem of memorizing
false or outdated associations during LLM training. However, we show that these
techniques can introduce large unwanted side effects which are not detected by
existing specificity benchmarks. We extend the existing CounterFact benchmark
to include a dynamic component and dub our benchmark CounterFact+.
Additionally, we extend the metrics used for measuring specificity by a
principled KL divergence-based metric. We use this improved benchmark to
evaluate recent model editing techniques and find that they suffer from low
specificity. Our findings highlight the need for improved specificity
benchmarks that identify and prevent unwanted side effects.
Related papers
- IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark [22.238377215355545]
We introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format.
We observe a significant performance gap between the state-of-the-art sub-10B open models vs. closed ones.
The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-11-12T01:05:55Z) - Position: LLM Unlearning Benchmarks are Weak Measures of Progress [31.957968729934745]
We find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods.
We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information.
arXiv Detail & Related papers (2024-10-03T18:07:25Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance [0.562479170374811]
Per-IMage Overlap (PIMO) is a novel metric that addresses the shortcomings of AUROC and AUPRO.
measuring recall per image simplifies computation and is more robust to noisy annotations.
Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights.
arXiv Detail & Related papers (2024-01-03T21:24:44Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Studying How to Efficiently and Effectively Guide Models with Explanations [52.498055901649025]
'Model guidance' is the idea of regularizing the models' explanations to ensure that they are "right for the right reasons"
We conduct an in-depth evaluation across various loss functions, attribution methods, models, and 'guidance depths' on the PASCAL VOC 2007 and MS COCO 2014 datasets.
Specifically, we guide the models via bounding box annotations, which are much cheaper to obtain than the commonly used segmentation masks.
arXiv Detail & Related papers (2023-03-21T15:34:50Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - A critical analysis of metrics used for measuring progress in artificial
intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.