A methodology for comparing and benchmarking quantum devices
- URL: http://arxiv.org/abs/2405.08617v1
- Date: Tue, 14 May 2024 13:58:53 GMT
- Title: A methodology for comparing and benchmarking quantum devices
- Authors: Jessica Park, Susan Stepney, Irene D'Amico,
- Abstract summary: It is first necessary to define the criteria for success: what are the metrics or statistics that are relevant to the problem?
This paper lays out a framework by which any user, developer or researcher can define, articulate and justify the success criteria and associated benchmarks that have been used to solve their problem or make their claim.
- Score: 0.19116784879310028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantum Computing (QC) is undergoing a high rate of development, investment and research devoted to its improvement.However, there is little consensus in the industry and wider literature as to what improvement might consist of beyond ambiguous statements of "more qubits" and "fewer errors". Before one can decide how to improve something, it is first necessary to define the criteria for success: what are the metrics or statistics that are relevant to the problem? The lack of clarity surrounding this question has led to a rapidly developing capability with little consistency or standards present across the board. This paper lays out a framework by which any user, developer or researcher can define, articulate and justify the success criteria and associated benchmarks that have been used to solve their problem or make their claim.
Related papers
- BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems.
We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z) - AI Agents That Matter [11.794931453828974]
AI agents are an exciting new research direction, and agent development is driven by benchmarks.
There is a narrow focus on accuracy without attention to other metrics.
benchmarking needs of model and downstream developers have been conflated.
Many agent benchmarks have inadequate holdout sets, and sometimes none at all.
arXiv Detail & Related papers (2024-07-01T17:48:14Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms.
We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation.
Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries.
Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules.
Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z) - Towards QD-suite: developing a set of benchmarks for Quality-Diversity
algorithms [0.0]
Existing benchmarks are not standardized, and there is currently no MNIST equivalent for Quality-Diversity (QD)
We argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable benchmarks is an important step.
arXiv Detail & Related papers (2022-05-06T13:33:50Z) - What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z) - A Framework for Evaluation of Machine Reading Comprehension Gold
Standards [7.6250852763032375]
This paper proposes a unifying framework to investigate the present linguistic features, required reasoning and background knowledge and factual correctness.
The absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.
arXiv Detail & Related papers (2020-03-10T11:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.