Towards a Fault-Injection Benchmarking Suite
- URL: http://arxiv.org/abs/2403.20319v1
- Date: Fri, 29 Mar 2024 17:42:31 GMT
- Title: Towards a Fault-Injection Benchmarking Suite
- Authors: Tianhao Wang, Robin Thunig, Horst Schirmeier,
- Abstract summary: There is no agreed-upon benchmarking suite for demonstrating fault-tolerance approaches.
As a replacement, authors pick benchmarks from other domains.
We propose criteria for benchmark selection.
- Score: 2.2373909071130877
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Soft errors in memories and logic circuits are known to disturb program execution. In this context, the research community has been proposing a plethora of fault-tolerance (FT) solutions over the last decades, as well as fault-injection (FI) approaches to test, measure and compare them. However, there is no agreed-upon benchmarking suite for demonstrating FT or FI approaches. As a replacement, authors pick benchmarks from other domains, e.g. embedded systems. This leads to little comparability across publications, and causes behavioral overlap within benchmarks that were not selected for orthogonality in the FT/FI domain. In this paper, we want to initiate a discussion on what a benchmarking suite for the FT/FI domain should look like, and propose criteria for benchmark selection.
Related papers
- Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese [3.724862061593193]
The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE)
We propose the first comprehensive FCE benchmark emphFace4RAG for RAG independent of the underlying Large Language Models (LLMs)
On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference.
arXiv Detail & Related papers (2024-07-01T08:35:04Z) - MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark that demands a meta reasoning skill.
MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process.
We introduce a simple disperse-then-merge framework to address the issue.
Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z) - Benchmarking Video Frame Interpolation [11.918489436283748]
We present a benchmark which establishes consistent error metrics by utilizing a submission website that computes them.
We also present a test set adhering to the assumption of linearity by utilizing synthetic data, and evaluate the computational efficiency in a coherent manner.
arXiv Detail & Related papers (2024-03-25T19:13:12Z) - Align Your Prompts: Test-Time Prompting with Distribution Alignment for
Zero-Shot Generalization [64.62570402941387]
We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain.
Our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe.
arXiv Detail & Related papers (2023-11-02T17:59:32Z) - Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels.
Unclear validation protocol for DA has led to bad practices in the literature.
We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z) - On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts.
We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Few-shot Fine-tuning is All You Need for Source-free Domain Adaptation [2.837894907597713]
We investigate the practicality of source-free unsupervised domain adaptation (SFUDA) over unsupervised domain adaptation (UDA)
We show that SFUDA relies on unlabeled target data, which limits its practicality in real-world applications.
We show that fine-tuning a source pretrained model with a few labeled data is a practical and reliable solution to circumvent the limitations of SFUDA.
arXiv Detail & Related papers (2023-04-03T08:24:40Z) - On the Assessment of Benchmark Suites for Algorithm Comparison [7.501426386641256]
We show that most benchmark functions of BBOB suite have high difficulty levels (compared to the optimization algorithms) and low discrimination.
We discuss potential uses of IRT in benchmarking, including its use to improve the design of benchmark suites.
arXiv Detail & Related papers (2021-04-15T11:20:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.