Related papers: Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

URL: http://arxiv.org/abs/2305.17553v2
Date: Sat, 3 Jun 2023 08:01:11 GMT
Title: Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Authors: Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas and Fazl Barez
Abstract summary: We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.
Score: 9.45927470587879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects.

Related papers

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability. Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark [22.238377215355545]
We introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format. We observe a significant performance gap between the state-of-the-art sub-10B open models vs. closed ones. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-11-12T01:05:55Z)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress [31.957968729934745]
We find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information.
arXiv Detail & Related papers (2024-10-03T18:07:25Z)
DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation [18.77296551727931]
We propose DECIDER, a novel approach that leverages priors from large language models (LLMs) and vision-language models (VLMs) to detect failures in image models. DECIDER consistently achieves state-of-the-art failure detection performance, significantly outperforming baselines in terms of the overall Matthews correlation coefficient.
arXiv Detail & Related papers (2024-08-01T07:08:11Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance [0.562479170374811]
Per-IMage Overlap (PIMO) is a novel metric that addresses the shortcomings of AUROC and AUPRO. measuring recall per image simplifies computation and is more robust to noisy annotations. Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights.
arXiv Detail & Related papers (2024-01-03T21:24:44Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Studying How to Efficiently and Effectively Guide Models with Explanations [52.498055901649025]
'Model guidance' is the idea of regularizing the models' explanations to ensure that they are "right for the right reasons" We conduct an in-depth evaluation across various loss functions, attribution methods, models, and 'guidance depths' on the PASCAL VOC 2007 and MS COCO 2014 datasets. Specifically, we guide the models via bounding box annotations, which are much cheaper to obtain than the commonly used segmentation masks.
arXiv Detail & Related papers (2023-03-21T15:34:50Z)
Learning Dynamic Compact Memory Embedding for Deformable Visual Object Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method. Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z)
A critical analysis of metrics used for measuring progress in artificial intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results. Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance. We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.