Related papers: Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models

Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models

URL: http://arxiv.org/abs/2506.06371v1
Date: Wed, 04 Jun 2025 12:11:05 GMT
Title: Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models
Authors: Panagiotis Koletsis, Christos Panagiotopoulos, Georgios Th. Papadopoulos, Vasilis Efthymiou,
Abstract summary: This work experiments with a hybrid approach for detecting relationships using a Knowledge Graph (KG) as a reference point, a task known as CPA.<n>This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations.<n>The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs.
Score: 4.201987249923826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.

Related papers

DMCD: Semantic-Statistical Framework for Causal Discovery [0.03499870393443267]
We present DMCD, a causal discovery framework that integrates semantic drafting from variable metadata with statistical validation on observational data.<n>We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis.
arXiv Detail & Related papers (2026-02-23T20:29:35Z)
Evaluating LLMs on Entity Disambiguation in Tables [0.9786690381850356]
This work proposes an extensive evaluation of four STI SOTA approaches: Alligator (formerly s-elbat), Dagobah, TURL, and TableLlama. We also include in the evaluation both GPT-4o and GPT-4o-mini, since they excel in various public benchmarks.
arXiv Detail & Related papers (2024-08-12T18:01:50Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-Label Classification [120.37051160567277]
This paper proposes a novel measure named Top-K Pairwise Ranking (TKPR) A series of analyses show that TKPR is compatible with existing ranking-based measures. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction.
arXiv Detail & Related papers (2024-07-09T09:36:37Z)
Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning [0.0]
This study explores how Language Models (LMs) can be used for feature representation and prediction in machine learning tasks. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning. Our findings reveal current pre-trained models should not replace conventional approaches.
arXiv Detail & Related papers (2024-06-19T21:19:37Z)
Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance. We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features. In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z)
Wiki-TabNER: Integrating Named Entity Recognition into Wikipedia Tables [18.330753799139845]
A new dataset, Wiki-TabNER, is proposed to enrich the existing benchmark datasets.<n>This paper describes the distinguishing features of the Wiki-TabNER dataset and the labeling process.<n>In addition, we propose a prompting framework for evaluating the new large language models on the within tables NER task.
arXiv Detail & Related papers (2024-03-07T15:22:07Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
Statistical Inference with Limited Memory: A Survey [22.41443027099101]
We review the state-of-the-art of statistical inference under memory constraints in several canonical problems. We discuss the main results in this developing field, and by identifying recurrent themes, we extract some fundamental building blocks for algorithmic construction.
arXiv Detail & Related papers (2023-12-23T11:14:33Z)
A Novel Energy based Model Mechanism for Multi-modal Aspect-Based Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis. PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information. EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z)
A Bayesian Methodology for Estimation for Sparse Canonical Correlation [0.0]
Canonical Correlation Analysis (CCA) is a statistical procedure for identifying relationships between data sets. ScSCCA is a rapidly emerging methodological area that aims for robust modeling of the interrelations between the different data modalities. We propose a novel ScSCCA approach where we employ a Bayesian infinite factor model and aim to achieve robust estimation.
arXiv Detail & Related papers (2023-10-30T15:14:25Z)
Joint Distributional Learning via Cramer-Wold Distance [0.7614628596146602]
We introduce the Cramer-Wold distance regularization, which can be computed in a closed-form, to facilitate joint distributional learning for high-dimensional datasets. We also introduce a two-step learning method to enable flexible prior modeling and improve the alignment between the aggregated posterior and the prior distribution.
arXiv Detail & Related papers (2023-10-25T05:24:23Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Knowledge Graph Embedding Methods for Entity Alignment: An Experimental Review [7.241438112282638]
We conduct the first meta-level analysis of popular embedding methods for entity alignment. Our analysis reveals statistically significant correlations of different embedding methods with various meta-features extracted by KGs. We rank them in a statistically significant way according to their effectiveness across all real-world KGs of our testbed.
arXiv Detail & Related papers (2022-03-17T12:11:58Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.