Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking for
Everyone
- URL: http://arxiv.org/abs/2110.05802v1
- Date: Tue, 12 Oct 2021 07:54:34 GMT
- Title: Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking for
Everyone
- Authors: Zhen Xu, Huan Zhao, Wei-Wei Tu, Magali Richard, Sergio Escalera,
Isabelle Guyon
- Abstract summary: We introduce Codabench, an open-sourced, community-driven platform for benchmarking algorithms or software agents versus datasets or tasks.
A public instance of Codabench is open to everyone, free of charge, and allows benchmark organizers to compare fairly submissions.
- Score: 45.673814384050004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Obtaining standardized crowdsourced benchmark of computational methods is a
major issue in scientific communities. Dedicated frameworks enabling fair
continuous benchmarking in a unified environment are yet to be developed. Here
we introduce Codabench, an open-sourced, community-driven platform for
benchmarking algorithms or software agents versus datasets or tasks. A public
instance of Codabench is open to everyone, free of charge, and allows benchmark
organizers to compare fairly submissions, under the same setting (software,
hardware, data, algorithms), with custom protocols and data formats. Codabench
has unique features facilitating the organization of benchmarks flexibly,
easily and reproducibly. Firstly, it supports code submission and data
submission for testing on dedicated compute workers, which can be supplied by
the benchmark organizers. This makes the system scalable, at low cost for the
platform providers. Secondly, Codabench benchmarks are created from
self-contained bundles, which are zip files containing a full description of
the benchmark in a configuration file (following a well-defined schema),
documentation pages, data, ingestion and scoring programs, making benchmarks
reusable and portable. The Codabench documentation includes many examples of
bundles that can serve as templates. Thirdly, Codabench uses dockers for each
task's running environment to make results reproducible. Codabench has been
used internally and externally with more than 10 applications during the past 6
months. As illustrative use cases, we introduce 4 diverse benchmarks covering
Graph Machine Learning, Cancer Heterogeneity, Clinical Diagnosis and
Reinforcement Learning.
Related papers
- Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - PruningBench: A Comprehensive Benchmark of Structural Pruning [50.23493036025595]
We present the first comprehensive benchmark, termed textitPruningBench, for structural pruning.
PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques.
It provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards.
arXiv Detail & Related papers (2024-06-18T06:37:26Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with
Distributed Stream Processing Frameworks [1.4374467687356276]
This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks.
ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform.
Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
arXiv Detail & Related papers (2024-03-07T15:06:24Z) - BeGin: Extensive Benchmark Scenarios and An Easy-to-use Framework for Graph Continual Learning [18.32208249344985]
Continual Learning (CL) is the process of learning ceaselessly a sequence of tasks.
graph data (graph CL) are relatively underexplored because of lack of standard experimental settings.
We develop BeGin, an easy and fool-proof framework for graph CL.
arXiv Detail & Related papers (2022-11-26T13:48:05Z) - WRENCH: A Comprehensive Benchmark for Weak Supervision [66.82046201714766]
benchmark consists of 22 varied real-world datasets for classification and sequence tagging.
We use benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform.
arXiv Detail & Related papers (2021-09-23T13:47:16Z) - Are Missing Links Predictable? An Inferential Benchmark for Knowledge
Graph Completion [79.07695173192472]
InferWiki improves upon existing benchmarks in inferential ability, assumptions, and patterns.
Each testing sample is predictable with supportive data in the training set.
In experiments, we curate two settings of InferWiki varying in sizes and structures, and apply the construction process on CoDEx as comparative datasets.
arXiv Detail & Related papers (2021-08-03T09:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.