Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
- URL: http://arxiv.org/abs/2312.00645v2
- Date: Mon, 25 Dec 2023 07:45:14 GMT
- Title: Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
- Authors: Paul Bricman
- Abstract summary: We propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers.
In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a growing need to gain insight into language model capabilities that
relate to sensitive topics, such as bioterrorism or cyberwarfare. However,
traditional open source benchmarks are not fit for the task, due to the
associated practice of publishing the correct answers in human-readable form.
At the same time, enforcing mandatory closed-quarters evaluations might stifle
development and erode trust. In this context, we propose hashmarking, a
protocol for evaluating language models in the open without having to disclose
the correct answers. In its simplest form, a hashmark is a benchmark whose
reference solutions have been cryptographically hashed prior to publication.
Following an overview of the proposed evaluation protocol, we go on to assess
its resilience against traditional attack vectors (e.g. rainbow table attacks),
as well as against failure modes unique to increasingly capable generative
models.
Related papers
- CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - The Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers [6.542294761666199]
Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset.<n>We show that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage.
arXiv Detail & Related papers (2025-10-17T18:09:33Z) - Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z) - The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage [71.8564105095189]
We introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model.<n>We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods.<n>We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference.
arXiv Detail & Related papers (2025-08-13T08:35:16Z) - What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks [8.012203293561196]
We show that one of the most widely used benchmarks for evaluating common-sense reasoning, HellaSwag, has severe construct validity issues.
We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state.
arXiv Detail & Related papers (2025-04-10T15:01:46Z) - Unsupervised dense retrieval with conterfactual contrastive learning [16.679649921935482]
We propose to improve the robustness of dense retrieval models by enhancing their sensitivity of fine-graned relevance signals.
A model achieving sensitivity in this context should exhibit high variances when documents' key passages determining their relevance to queries have been modified.
Motivated by causality and counterfactual analysis, we propose a series of counterfactual regularization methods.
arXiv Detail & Related papers (2024-12-30T07:01:34Z) - Provably Secure Disambiguating Neural Linguistic Steganography [66.30965740387047]
The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures.
We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem.
SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods.
arXiv Detail & Related papers (2024-03-26T09:25:57Z) - Towards Imperceptible Document Manipulations against Neural Ranking
Models [13.777462017782659]
We propose a framework called Imperceptible DocumEnt Manipulation (IDEM) to produce adversarial documents.
IDEM instructs a well-established generative language model, such as BART, to generate connection sentences without introducing easy-to-detect errors.
We show that IDEM can outperform strong baselines while preserving fluency and correctness of the target documents.
arXiv Detail & Related papers (2023-05-03T02:09:29Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Detecting Backdoors in Deep Text Classifiers [43.36440869257781]
We present the first robust defence mechanism that generalizes to several backdoor attacks against text classification models.
Our technique is highly accurate at defending against state-of-the-art backdoor attacks, including data poisoning and weight poisoning.
arXiv Detail & Related papers (2022-10-11T07:48:03Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.