Scientific evidence extraction
- URL: http://arxiv.org/abs/2110.00061v1
- Date: Thu, 30 Sep 2021 19:42:07 GMT
- Title: Scientific evidence extraction
- Authors: Brandon Smock and Rohith Pesala and Robin Abraham
- Abstract summary: We propose a new dataset, Tables One Million (PubTables-1M), and a new class of metric, PubMed grid table similarity (GriTS)
PubTables-1M is nearly twice as large as the previous largest comparable dataset.
We show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, interest has grown in applying machine learning to the problem of
table structure inference and extraction from unstructured documents. However,
progress in this area has been challenging both to make and to measure, due to
several issues that arise in training and evaluating models from labeled data.
This includes challenges as fundamental as the lack of a single definitive
ground truth output for each input sample and the lack of an ideal metric for
measuring partial correctness for this task. To address these we propose a new
dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric,
grid table similarity (GriTS). PubTables-1M is nearly twice as large as the
previous largest comparable dataset, can be used for models across multiple
architectures and modalities, and addresses issues such as ambiguity and lack
of consistency in the annotations. We apply DETR to table extraction for the
first time and show that object detection models trained on PubTables-1M
produce excellent results out-of-the-box for all three tasks of detection,
structure recognition, and functional analysis. We describe the dataset in
detail to enable others to build on our work and combine this data with other
datasets for these and related tasks. It is our hope that PubTables-1M and the
proposed metrics can further progress in this area by creating a benchmark
suitable for training and evaluating a wide variety of models for table
extraction. Data and code will be released at
https://github.com/microsoft/table-transformer.
Related papers
- TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Testing the Limits of Unified Sequence to Sequence LLM Pretraining on
Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z) - Retrieval-Based Transformer for Table Augmentation [14.460363647772745]
We introduce a novel approach toward automatic data wrangling.
We aim to address table augmentation tasks, including row/column population and data imputation.
Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
arXiv Detail & Related papers (2023-06-20T18:51:21Z) - Towards End-to-End Semi-Supervised Table Detection with Deformable
Transformer [11.648151981111436]
Table detection is the task of classifying and localizing table objects within document images.
Many semi-supervised approaches are introduced to mitigate the need for a substantial amount of label data.
This paper presents a novel end-to-end semi-supervised table detection method that employs the deformable transformer for detecting table objects.
arXiv Detail & Related papers (2023-05-04T12:15:15Z) - Parameter-Efficient Abstractive Question Answering over Tables or Text [60.86457030988444]
A long-term ambition of information seeking QA systems is to reason over multi-modal contexts and generate natural answers to user queries.
Memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables.
To avoid training such memory-hungry models while utilizing a uniform architecture for each modality, parameter-efficient adapters add and train small task-specific bottle-neck layers between transformer layers.
arXiv Detail & Related papers (2022-04-07T10:56:29Z) - Data Augmentation for Abstractive Query-Focused Multi-Document
Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods.
These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries.
We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.