Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
- URL: http://arxiv.org/abs/2505.21329v2
- Date: Wed, 28 May 2025 11:44:41 GMT
- Title: Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
- Authors: Allaa Boutaleb, Bernd Amann, Hubert Naacke, Rafael Angarita,
- Abstract summary: Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes.<n>These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks.<n>We propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.
- Score: 0.9374652839580181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.
Related papers
- LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - TARGET: Benchmarking Table Retrieval for Generative Tasks [7.379012456053551]
TARGET is a benchmark for evaluating TAble Retrieval for GEnerative Tasks.<n>We analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks.<n>We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text.
arXiv Detail & Related papers (2025-05-14T19:39:46Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [32.9031799179503]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.<n>Our data, code and models will be released to facilitate future research.
arXiv Detail & Related papers (2025-04-09T03:46:56Z) - Bridging Queries and Tables through Entities in Table Retrieval [70.13748256886288]
Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval.<n>We propose an entity-enhanced training framework and design an interaction paradigm based on entity representations.<n>Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes.
arXiv Detail & Related papers (2025-04-09T03:16:33Z) - Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP.
Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details.
While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z) - Evaluating LLMs on Entity Disambiguation in Tables [0.9786690381850356]
This work proposes an extensive evaluation of four STI SOTA approaches: Alligator (formerly s-elbat), Dagobah, TURL, and TableLlama.
We also include in the evaluation both GPT-4o and GPT-4o-mini, since they excel in various public benchmarks.
arXiv Detail & Related papers (2024-08-12T18:01:50Z) - A Closer Look at Deep Learning Methods on Tabular Datasets [52.50778536274327]
Tabular data is prevalent across diverse domains in machine learning.<n>Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.<n>We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks [30.922069185335246]
We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - Generative Benchmark Creation for Table Union Search [4.970364068620607]
We present a novel method for using generative models to create tables with specified properties.
We show that the new benchmark is more challenging for all methods than hand-curated benchmarks.
arXiv Detail & Related papers (2023-08-07T19:26:09Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.