Related papers: Schema-Driven Information Extraction from Heterogeneous Tables

Schema-Driven Information Extraction from Heterogeneous Tables

URL: http://arxiv.org/abs/2305.14336v5
Date: Wed, 20 Nov 2024 20:13:31 GMT
Title: Schema-Driven Information Extraction from Heterogeneous Tables
Authors: Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze, Alan Ritter,
Abstract summary: We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
Score: 37.50854811537401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.

Related papers

Language Model Representations for Efficient Few-Shot Tabular Classification [17.63549220100997]
Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search.<n>This work investigates a lightweight paradigm, $textbfTa$ble $textbfR$epresentation with $textbfL$anguage Model.<n>We show that our approach achieves performance comparable to state-of-the-art models in low-data regimes.
arXiv Detail & Related papers (2026-01-21T23:28:51Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [28.47810405584841]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables [5.365164774382722]
We introduce TalentMine, a novel framework that transforms extracted tables into semantically enriched representations.<n> TalentMine achieves 100% accuracy in query answering tasks compared to 0% for standard AWS Textract extraction.<n>Our comparative analysis also reveals that the Claude v3 Haiku model achieves optimal performance for talent management applications.
arXiv Detail & Related papers (2025-06-22T22:17:42Z)
Taxonomy Inference for Tabular Data Using Large Language Models [31.121233193993906]
This paper presents two methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4.
arXiv Detail & Related papers (2025-03-25T16:26:05Z)
Better Think with Tables: Tabular Structures Enhance LLM Comprehension for Data-Analytics Requests [33.471112091886894]
Large Language Models (LLMs) often struggle with data-analytics requests related to information retrieval and data manipulation.<n>We introduce Thinking with Tables, where we inject tabular structures into LLMs for data-analytics requests.<n>We show that providing tables yields a 40.29 percent average performance gain along with better manipulation and token efficiency.
arXiv Detail & Related papers (2024-12-22T23:31:03Z)
SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction [1.0624606551524207]
Existing datasets often focus on scientific tables due to the vast amount of academic articles. Current datasets often lack the words, and their positions, contained within the tables. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables.
arXiv Detail & Related papers (2024-12-05T15:42:59Z)
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models [58.34560740973768]
We introduce a framework that leverages language models (LMs) to generate literature review tables. A new dataset of 2,228 literature review tables extracted from ArXiv papers synthesize a total of 7,542 research papers. We evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context.
arXiv Detail & Related papers (2024-10-25T18:31:50Z)
Uncovering Limitations of Large Language Models in Information Seeking from Tables [28.19697259795014]
This paper introduces a more reliable benchmark for Table Information Seeking (TabIS) To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format.
arXiv Detail & Related papers (2024-06-06T14:30:59Z)
Wiki-TabNER:Advancing Table Interpretation Through Named Entity Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks. To overcome this drawback, we construct and annotate a new more challenging dataset. We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z)
Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction. We reformulate the task to be entity-centric, enabling the use of diverse metrics. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z)
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning [55.33939289989238]
We propose TAP4LLM as a versatile pre-processor suite for leveraging large language models (LLMs) in table-based tasks effectively. It covers several distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmentation to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into various formats suitable for LLMs' understanding.
arXiv Detail & Related papers (2023-12-14T15:37:04Z)
HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation [7.69801337810352]
We conduct parameter-efficient fine-tuning on the LLaMA2 model. Our approach involves injecting reasoning information into the input by emphasizing table-specific row data. On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results.
arXiv Detail & Related papers (2023-11-15T12:02:52Z)
All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z)
QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions. We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z)
A Graph Representation of Semi-structured Data for Web Question Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.