Related papers: OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

URL: http://arxiv.org/abs/2512.14051v1
Date: Tue, 16 Dec 2025 03:33:24 GMT
Title: OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Authors: Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu,
Abstract summary: OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
Score: 74.80873109856563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.

Related papers

Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets [46.480867560675584]
We propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA)<n>We instantiate this methodology through two new datasets: textbfODA-Math460-k, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and textbfODA-Mixture (100k & 500k), a series of multi-domain instruction datasets built via an Anchor-and-
arXiv Detail & Related papers (2025-12-30T17:46:38Z)
Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z)
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance [29.94723846950853]
We conduct the first comprehensive analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk.<n>We derive statistics that reveal structural and qualitative similarities and differences between the two datasets.<n>Our findings offer actionable insights for constructing more effective post-training datasets.
arXiv Detail & Related papers (2025-06-06T20:34:06Z)
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z)
Exploring Data Redundancy in Real-world Image Classification through Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z)
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP [43.7219097444333]
We introduce a testbed of six publicly available data sources to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts. We find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source.
arXiv Detail & Related papers (2022-08-10T18:24:23Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.