Related papers: A methodology for comparing and benchmarking quantum devices

A methodology for comparing and benchmarking quantum devices

URL: http://arxiv.org/abs/2405.08617v1
Date: Tue, 14 May 2024 13:58:53 GMT
Title: A methodology for comparing and benchmarking quantum devices
Authors: Jessica Park, Susan Stepney, Irene D'Amico,
Abstract summary: It is first necessary to define the criteria for success: what are the metrics or statistics that are relevant to the problem? This paper lays out a framework by which any user, developer or researcher can define, articulate and justify the success criteria and associated benchmarks that have been used to solve their problem or make their claim.
Score: 0.19116784879310028
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantum Computing (QC) is undergoing a high rate of development, investment and research devoted to its improvement.However, there is little consensus in the industry and wider literature as to what improvement might consist of beyond ambiguous statements of "more qubits" and "fewer errors". Before one can decide how to improve something, it is first necessary to define the criteria for success: what are the metrics or statistics that are relevant to the problem? The lack of clarity surrounding this question has led to a rapidly developing capability with little consistency or standards present across the board. This paper lays out a framework by which any user, developer or researcher can define, articulate and justify the success criteria and associated benchmarks that have been used to solve their problem or make their claim.

Related papers

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [65.39456695678713]
We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. We introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
arXiv Detail & Related papers (2025-04-17T22:16:30Z)
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [29.437125712259046]
Reasoning has emerged as the next major frontier for language models (LMs) We conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices. We propose a standardized evaluation framework with clearly defined best practices and reporting standards.
arXiv Detail & Related papers (2025-04-09T17:58:17Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability. Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models [41.85078638790154]
Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. We present a benchmark with 594 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot.
arXiv Detail & Related papers (2025-02-03T18:10:38Z)
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics [52.242449026151846]
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs) We propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems. We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z)
AI Agents That Matter [11.794931453828974]
AI agents are an exciting new research direction, and agent development is driven by benchmarks. There is a narrow focus on accuracy without attention to other metrics. benchmarking needs of model and downstream developers have been conflated. Many agent benchmarks have inadequate holdout sets, and sometimes none at all.
arXiv Detail & Related papers (2024-07-01T17:48:14Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z)
Towards QD-suite: developing a set of benchmarks for Quality-Diversity algorithms [0.0]
Existing benchmarks are not standardized, and there is currently no MNIST equivalent for Quality-Diversity (QD) We argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable benchmarks is an important step.
arXiv Detail & Related papers (2022-05-06T13:33:50Z)
What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet. Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z)
A Framework for Evaluation of Machine Reading Comprehension Gold Standards [7.6250852763032375]
This paper proposes a unifying framework to investigate the present linguistic features, required reasoning and background knowledge and factual correctness. The absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.
arXiv Detail & Related papers (2020-03-10T11:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.