TabLeX: A Benchmark Dataset for Structure and Content Information
Extraction from Scientific Tables
- URL: http://arxiv.org/abs/2105.06400v1
- Date: Wed, 12 May 2021 05:13:38 GMT
- Title: TabLeX: A Benchmark Dataset for Structure and Content Information
Extraction from Scientific Tables
- Authors: Harsh Desai, Pratik Kayal, Mayank Singh
- Abstract summary: This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles.
To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts.
Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images.
- Score: 1.4115224153549193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information Extraction (IE) from the tables present in scientific articles is
challenging due to complicated tabular representations and complex embedded
text. This paper presents TabLeX, a large-scale benchmark dataset comprising
table images generated from scientific articles. TabLeX consists of two
subsets, one for table structure extraction and the other for table content
extraction. Each table image is accompanied by its corresponding LATEX source
code. To facilitate the development of robust table IE tools, TabLeX contains
images in different aspect ratios and in a variety of fonts. Our analysis sheds
light on the shortcomings of current state-of-the-art table extraction models
and shows that they fail on even simple table images. Towards the end, we
experiment with a transformer-based existing baseline to report performance
scores. In contrast to the static benchmarks, we plan to augment this dataset
with more complex and diverse tables at regular intervals.
Related papers
- UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model.
UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - HYTREL: Hypergraph-enhanced Tabular Data Representation Learning [36.731257438472035]
HYTREL is a language model that captures the row/column permutation invariances and three more structural properties of tabular data.
We show that HYTREL consistently outperforms other competitive baselines on four downstream tasks with minimal pretraining.
Our qualitative analyses showcase that HYTREL can assimilate the table structures to generate robust representations for the cells, rows, columns, and the entire table.
arXiv Detail & Related papers (2023-07-14T05:41:22Z) - Tables to LaTeX: structure and content extraction from scientific tables [0.848135258677752]
We adapt the transformer-based language modeling paradigm for scientific table structure and content extraction.
We achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively.
arXiv Detail & Related papers (2022-10-31T12:08:39Z) - Graph Neural Networks and Representation Embedding for Table Extraction
in PDF Documents [1.1859913430860336]
The main contribution of this work is to tackle the problem of table extraction, exploiting Graph Neural Networks.
We experimentally evaluated the proposed approach on a new dataset obtained by merging the information provided in the PubLayNet and PubTables-1M datasets.
arXiv Detail & Related papers (2022-08-23T21:36:01Z) - DiSCoMaT: Distantly Supervised Composition Extraction from Tables in
Materials Science Articles [25.907266860321727]
We define a novel NLP task of extracting compositions of materials from tables in materials science papers.
We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables.
We show that DISCOMAT outperforms recent table processing architectures by significant margins.
arXiv Detail & Related papers (2022-07-03T17:11:17Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z) - GFTE: Graph-based Financial Table Extraction [66.26206038522339]
In financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images.
We publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds.
We propose a novel graph-based convolutional network model named GFTE as a baseline for future comparison.
arXiv Detail & Related papers (2020-03-17T07:10:05Z) - Table Structure Extraction with Bi-directional Gated Recurrent Unit
Networks [5.350788087718877]
This paper proposes a robust deep learning based approach to extract rows and columns from a detected table in document images with a high precision.
We have benchmarked our system on publicly available UNLV as well as ICDAR 2013 datasets on which it outperformed the state-of-the-art table structure extraction systems by a significant margin.
arXiv Detail & Related papers (2020-01-08T13:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.