CTE: A Dataset for Contextualized Table Extraction
- URL: http://arxiv.org/abs/2302.01451v1
- Date: Thu, 2 Feb 2023 22:38:23 GMT
- Title: CTE: A Dataset for Contextualized Table Extraction
- Authors: Andrea Gemelli, Emanuele Vivoli, Simone Marinai
- Abstract summary: The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables.
Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets.
The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis.
- Score: 1.1859913430860336
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Relevant information in documents is often summarized in tables, helping the
reader to identify useful facts. Most benchmark datasets support either
document layout analysis or table understanding, but lack in providing data to
apply both tasks in a unified way. We define the task of Contextualized Table
Extraction (CTE), which aims to extract and define the structure of tables
considering the textual context of the document. The dataset comprises 75k
fully annotated pages of scientific papers, including more than 35k tables.
Data are gathered from PubMed Central, merging the information provided by
annotations in the PubTables-1M and PubLayNet datasets. The dataset can support
CTE and adds new classes to the original ones. The generated annotations can be
used to develop end-to-end pipelines for various tasks, including document
layout analysis, table detection, structure recognition, and functional
analysis. We formally define CTE and evaluation metrics, showing which subtasks
can be tackled, describing advantages, limitations, and future works of this
collection of data. Annotations and code will be accessible a
https://github.com/AILab-UniFI/cte-dataset.
Related papers
- UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model.
UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text
Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document.
Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability.
We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z) - A large-scale dataset for end-to-end table recognition in the wild [13.717478398235055]
Table recognition (TR) is one of the research hotspots in pattern recognition.
Currently, the end-to-end TR in real scenarios, accomplishing the three sub-tasks simultaneously, is yet an unexplored research area.
We propose a new large-scale dataset named Table Recognition Set (TabRecSet) with diverse table forms sourcing from multiple scenarios in the wild.
arXiv Detail & Related papers (2023-03-27T02:48:51Z) - DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z) - Long Text and Multi-Table Summarization: Dataset and Method [20.90939310713561]
FINDSum is built on 21,125 annual reports from 3,794 companies.
It has two subsets for summarizing each company's results of operations and liquidity.
We propose a set of evaluation metrics to assess the usage of numerical information in produced summaries.
arXiv Detail & Related papers (2023-02-08T00:46:55Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs)
We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title.
Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.