Related papers: DP-Bench: A Benchmark for Evaluating Data Product Creation Systems

DP-Bench: A Benchmark for Evaluating Data Product Creation Systems

URL: http://arxiv.org/abs/2512.15798v1
Date: Tue, 16 Dec 2025 19:19:01 GMT
Title: DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
Authors: Faisal Chowdhury, Sola Shirai, Sarthak Dash, Nandana Mihindukulasooriya, Horst Samulowitz,
Abstract summary: DP-Bench is a benchmark to evaluate automatic data product creation.<n>We describe how this benchmark was created by taking advantage of existing work in ELT and Text-to-hugging benchmarks.<n>We propose a number of approaches that can be considered as baselines for generating data products automatically.
Score: 6.79084373554523
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench and supplementary materials available in https://huggingface.co/datasets/ibm-research/dp-bench .

Related papers

From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text [14.615452158253774]
DPBench is the first user-request-driven data product benchmark over hybrid table-text corpora.<n>Our framework systematically repurposes existing table-text QA datasets by clustering related tables and passages into coherent data products.
arXiv Detail & Related papers (2025-09-30T23:07:36Z)
EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association [83.4879773429742]
This paper defines the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.<n>We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step.<n>We construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products.
arXiv Detail & Related papers (2025-05-21T07:21:38Z)
Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana [15.898927916560892]
DataMorgana is a tool for generating highly customizable and diverse synthetic Q&A benchmarks tailored to RAG applications.<n>It enables detailed configurations of user and question categories and provides control over their distribution within the benchmark.<n>DataMorgana will be made available to selected teams in the research community, as first beta testers, in the context of the upcoming SIGIR'2025 LiveRAG challenge.
arXiv Detail & Related papers (2025-01-22T10:47:08Z)
Self-Refinement Strategies for LLM-based Product Attribute Value Extraction [51.45146101802871]
This paper investigates applying two self-refinement techniques to the product attribute value extraction task.<n>The experiments show that both self-refinement techniques fail to significantly improve the extraction performance while substantially increasing processing costs.<n>For scenarios with development data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
arXiv Detail & Related papers (2025-01-02T12:55:27Z)
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration [2.7248990920379725]
We show that the performance of methods based on large language models (LLMs) seriously degrades when tested on real-world datasets.<n>We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration.
arXiv Detail & Related papers (2024-12-29T03:07:20Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios [51.66718740300016]
TableLLM is a robust large language model (LLM) with 8 billion parameters.<n>TableLLM is purpose-built for proficiently handling data manipulation tasks.<n>We have released the model checkpoint, source code, benchmarks, and a web application for user interaction.
arXiv Detail & Related papers (2024-03-28T11:21:12Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.