Fugu-MT 論文翻訳(概要): From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text

論文の概要: From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text

arxiv url: http://arxiv.org/abs/2510.21737v1
Date: Tue, 30 Sep 2025 23:07:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 05:35:45.942075
Title: From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text
Title（参考訳）: ファクトイドからデータプロダクト要求へ: テーブルとテキストによるデータプロダクト発見のベンチマーク
Authors: Liangliang Zhang, Nandana Mihindukulasooriya, Niharika S. D'Souza, Sola Shirai, Sarthak Dash, Yao Ma, Horst Samulowitz,
Abstract要約: DPBenchは、ハイブリッドテーブルテキストコーパス上でのユーザ要求駆動型データ製品ベンチマークである。本フレームワークは,既存のテーブルテキストQAデータセットを,関連するテーブルやパスを一貫性のあるデータ製品にクラスタリングすることで,体系的に再利用する。
参考スコア（独自算出の注目度）: 14.615452158253774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery and generation is of great industry interest, as it enables discovery in large data lakes and supports analytical Data Product Requests (DPRs). Currently, there is no benchmark established specifically for data product discovery. Existing datasets focus on answering single factoid questions over individual tables rather than collecting multiple data assets for broader, coherent products. To address this gap, we introduce DPBench, the first user-request-driven data product benchmark over hybrid table-text corpora. Our framework systematically repurposes existing table-text QA datasets by clustering related tables and passages into coherent data products, generating professional-level analytical requests that span both data sources, and validating benchmark quality through multi-LLM evaluation. DPBench preserves full provenance while producing actionable, analyst-like data product requests. Baseline experiments with hybrid retrieval methods establish the feasibility of DPR evaluation, reveal current limitations, and point to new opportunities for automatic data product discovery research. Code and datasets are available at: https://anonymous.4open.science/r/data-product-benchmark-BBA7/
Abstract（参考訳）: データ製品は再利用可能で、特定のビジネスユースケースのために設計された自己完結型の資産です。大規模なデータレイクでの発見を可能にし、分析データ製品要求(DPR)をサポートするため、その発見と生成を自動化することは、業界における大きな関心事である。現在、データ製品発見専用に確立されたベンチマークは存在しない。既存のデータセットは、より広範囲で一貫性のある製品のために複数のデータ資産を集めるのではなく、個々のテーブルに対して単一のファクトイド質問に答えることに重点を置いている。このギャップに対処するため、我々はDPBenchを紹介した。DPBenchは、ハイブリッドテーブルテキストコーパス上でのユーザ要求駆動型データ製品ベンチマークである。本フレームワークは,関係するテーブルやパスを一貫性のあるデータ製品にクラスタリングし,両方のデータソースにまたがるプロレベルの分析要求を生成し,マルチLLM評価によってベンチマーク品質を検証することによって,既存のテーブルテキストQAデータセットを体系的に再利用する。 DPBenchは、アクション可能なアナリストのようなデータ製品要求を生成しながら、完全な実績を保っている。ハイブリッド検索手法によるベースライン実験は、DPR評価の実現可能性を確立し、現在の限界を明らかにし、自動データ製品発見研究の新たな機会を示す。コードとデータセットは以下の通りである。 https://anonymous.4open.science/r/data-product-benchmark-BBA7/

論文の概要: From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text

関連論文リスト