Related papers: Towards Contamination Resistant Benchmarks

Towards Contamination Resistant Benchmarks

URL: http://arxiv.org/abs/2505.08389v1
Date: Tue, 13 May 2025 09:35:40 GMT
Title: Towards Contamination Resistant Benchmarks
Authors: Rahmatullah Musawi, Sheng Lu,
Abstract summary: evaluating large language models (LLMs) properly is crucial for understanding their potential and addressing concerns such as safety.<n> contamination stands out as a key issue that undermines the reliability of evaluations.<n>We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark.
Score: 0.6906005491572401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., "ab" to "bc" when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.

Related papers

Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge [0.3759936323189418]
Large Language Models (LLMs) have become increasingly powerful and ubiquitous, but their nature poses challenges to the reliability of their outputs.<n>We introduce a novel framework for rigorously evaluating the reliability of LLM judgments, leveraging McDonald's omega.
arXiv Detail & Related papers (2024-12-17T03:37:31Z)
What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs [95.15814662348245]
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs) have demonstrated remarkable proficiency in such reasoning tasks.
arXiv Detail & Related papers (2024-06-12T12:54:27Z)
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations [22.011216436252845]
We present PertEval, a toolkit for probing large language models' knowledge capacity. PertEval employs human-like restatement techniques to generate on-the-fly test samples from static benchmarks. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs.
arXiv Detail & Related papers (2024-05-30T06:38:32Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment [8.948475969696075]
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. We show that short universal adversarial phrases can be deceived to judge LLMs to predict inflated scores. It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring.
arXiv Detail & Related papers (2024-02-21T18:55:20Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks.<n>The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction.<n>However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods. We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.