Schema Inference for Tabular Data Repositories Using Large Language Models
- URL: http://arxiv.org/abs/2509.04632v1
- Date: Thu, 04 Sep 2025 19:50:16 GMT
- Title: Schema Inference for Tabular Data Repositories Using Large Language Models
- Authors: Zhenyu Wu, Jiaoyan Chen, Norman W. Paton,
- Abstract summary: We present SI-LLM, which infers a concise conceptual schema for data using only column headers and cell values.<n> SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step.
- Score: 12.626848016550051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.
Related papers
- Table as a Modality for Large Language Models [28.392792653645998]
We show a probing experiment on our proposed StructQA benchmark.<n>We propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens.
arXiv Detail & Related papers (2025-11-30T15:59:56Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance [8.83042313837811]
We propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths.<n>Given a natural language query, our method searches on graph to construct interpretable reasoning chains, aided by pruning and sub-path merging strategies.<n>Experiments on both standard benchmarks and a realistic, large-scale dataset demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-04T20:21:52Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification [50.59009084277447]
We introduce UNJOIN, a framework that decouples the retrieval of schema elements from logic generation.<n>In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name.<n>In the second stage, the query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic.
arXiv Detail & Related papers (2025-05-23T17:28:43Z) - Taxonomy Inference for Tabular Data Using Large Language Models [31.121233193993906]
This paper presents two methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4.
arXiv Detail & Related papers (2025-03-25T16:26:05Z) - GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction [9.784347635082232]
We propose GReaTER to generate realistic Tabular Data.<n> GReaTER includes a data semantic enhancement system and a cross-table connecting method.<n> Experimental results show that GReaTER outperforms the GReaT framework.
arXiv Detail & Related papers (2025-03-19T04:16:05Z) - LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion [49.898152180805454]
Synthetic datasets must maintain domain-specific logical consistency.<n>Existing generative models often overlook these inter-column relationships.<n>This study presents the first method to effectively preserve inter-column relationships without requiring domain knowledge.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - ReMatch: Retrieval Enhanced Schema Matching with LLMs [0.874967598360817]
We present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs)
Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher.
arXiv Detail & Related papers (2024-03-03T17:14:40Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers [0.0]
We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model.
It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model.
Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
arXiv Detail & Related papers (2023-02-04T00:32:50Z) - TabEAno: Table to Knowledge Graph Entity Annotation [7.451544182579802]
We propose a novel approach, namely TabEAno, to semantically annotate table rows toward knowledge graph entities.
We introduce a "two-cells" lookup strategy bases on the assumption that there is an existing logical relation occurring in the knowledge graph between the two closed cells in the same row of the table.
Despite the simplicity of the approach, TabEAno outperforms the state of the art approaches in the two standard datasets.
arXiv Detail & Related papers (2020-10-05T07:39:02Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.