Related papers: Global Benchmark Database

Related papers

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation [13.897645524385274]
BenchHub is a dynamic benchmark repository that empowers researchers and developers to evaluate large language models (LLMs) more effectively.<n>It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases.
arXiv Detail & Related papers (2025-05-31T09:24:32Z)
Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval [19.57735892785756]
BMEmbed is a novel method for adapting general-purpose text embedding models to private datasets.<n>We construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation.<n>We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance.
arXiv Detail & Related papers (2025-05-31T03:06:09Z)
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling [34.697209279932686]
General Document Intelligence Benchmark features 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty. We propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning process.
arXiv Detail & Related papers (2025-04-30T15:46:46Z)
DataS^3: Dataset Subset Selection for Specialization [60.589117206895125]
We introduce DataS3, the first dataset and benchmark designed specifically for the DS3 problem. DataS3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent.
arXiv Detail & Related papers (2025-04-22T21:25:14Z)
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery [9.723791642707738]
Domain Generalisation (DG) seeks to bridge the gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training. We examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts. We introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets.
arXiv Detail & Related papers (2025-03-24T23:04:06Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features. It consists of 100 datasets representing diverse business use cases such as finance and incident management. Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Generalizable Metric Network for Cross-domain Person Re-identification [55.71632958027289]
Cross-domain (i.e., domain generalization) scene presents a challenge in Re-ID tasks. Most existing methods aim to learn domain-invariant or robust features for all domains. We propose a Generalizable Metric Network (GMN) to explore sample similarity in the sample-pair space.
arXiv Detail & Related papers (2023-06-21T03:05:25Z)
OPTION: OPTImization Algorithm Benchmarking ONtology [4.060078409841919]
OPTION (OPTImization algorithm benchmarking ONtology) is a semantically rich, machine-readable data model for benchmarking platforms. Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process. It also provides means for automatic data integration, improved interoperability, and powerful querying capabilities.
arXiv Detail & Related papers (2022-11-21T10:34:43Z)
Improving Multi-Domain Generalization through Domain Re-labeling [31.636953426159224]
We study the important link between pre-specified domain labels and the generalization performance. We introduce a general approach for multi-domain generalization, MulDEns, that uses an ERM-based deep ensembling backbone. We show that MulDEns does not require tailoring the augmentation strategy or the training process specific to a dataset.
arXiv Detail & Related papers (2021-12-17T23:21:50Z)
OPTION: OPTImization Algorithm Benchmarking ONtology [4.060078409841919]
OPTION (OPTImization algorithm benchmarking ONtology) is a semantically rich, machine-readable data model for benchmarking algorithms. Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process. It also provides means for automated data integration, improved interoperability, powerful querying capabilities and reasoning.
arXiv Detail & Related papers (2021-04-24T06:11:30Z)
Mapping Patterns for Virtual Knowledge Graphs [71.61234136161742]
Virtual Knowledge Graphs (VKG) constitute one of the most promising paradigms for integrating and accessing legacy data sources. We build on well-established methodologies and patterns studied in data management, data analysis, and conceptual modeling. We validate our catalog on the considered VKG scenarios, showing it covers the vast majority of patterns present therein.
arXiv Detail & Related papers (2020-12-03T13:54:52Z)
Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains. For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.