Taxonomy Inference for Tabular Data Using Large Language Models
- URL: http://arxiv.org/abs/2503.21810v1
- Date: Tue, 25 Mar 2025 16:26:05 GMT
- Title: Taxonomy Inference for Tabular Data Using Large Language Models
- Authors: Zhenyu Wu, Jiaoyan Chen, Norman W. Paton,
- Abstract summary: This paper presents two methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4.
- Score: 31.121233193993906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.
Related papers
- GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction [9.784347635082232]
We propose GReaTER to generate realistic Tabular Data.<n> GReaTER includes a data semantic enhancement system and a cross-table connecting method.<n> Experimental results show that GReaTER outperforms the GReaT framework.
arXiv Detail & Related papers (2025-03-19T04:16:05Z) - TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models [57.005158277893194]
TableLoRA is a module designed to improve LLMs' understanding of table structure during PEFT.<n>It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions.
arXiv Detail & Related papers (2025-03-06T12:50:14Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively.
It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z) - HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model.
Our approach involves injecting reasoning information into the input by emphasizing table-specific row data.
On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.
Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z) - HiTab: A Hierarchical Table Dataset for Question Answering and Natural
Language Generation [35.73434495391091]
Hierarchical tables challenge existing methods by hierarchical indexing, as well as implicit relationships of calculation and semantics.
This work presents HiTab, a free and open dataset for the research community to study question answering (QA) and natural language generation (NLG) over hierarchical tables.
arXiv Detail & Related papers (2021-08-15T10:14:21Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.