DiSCoMaT: Distantly Supervised Composition Extraction from Tables in
Materials Science Articles
- URL: http://arxiv.org/abs/2207.01079v4
- Date: Sun, 28 Jan 2024 21:14:26 GMT
- Title: DiSCoMaT: Distantly Supervised Composition Extraction from Tables in
Materials Science Articles
- Authors: Tanishq Gupta, Mohd Zaki, Devanshi Khatsuriya, Kausik Hira, N. M.
Anoop Krishnan, Mausam
- Abstract summary: We define a novel NLP task of extracting compositions of materials from tables in materials science papers.
We release a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables.
We show that DISCOMAT outperforms recent table processing architectures by significant margins.
- Score: 25.907266860321727
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A crucial component in the curation of KB for a scientific domain (e.g.,
materials science, foods & nutrition, fuels) is information extraction from
tables in the domain's published research articles. To facilitate research in
this direction, we define a novel NLP task of extracting compositions of
materials (e.g., glasses) from tables in materials science papers. The task
involves solving several challenges in concert, such as tables that mention
compositions have highly varying structures; text in captions and full paper
needs to be incorporated along with data in tables; and regular languages for
numbers, chemical compounds and composition expressions must be integrated into
the model. We release a training dataset comprising 4,408 distantly supervised
tables, along with 1,475 manually annotated dev and test tables. We also
present a strong baseline DISCOMAT, that combines multiple graph neural
networks with several task-specific regular expressions, features, and
constraints. We show that DISCOMAT outperforms recent table processing
architectures by significant margins.
Related papers
- TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy [81.76462101465354]
We present a novel large vision-hugging model, TabPedia, equipped with a concept synergy mechanism.
This unified framework allows TabPedia to seamlessly integrate VTU tasks, such as table detection, table structure recognition, table querying, and table question answering.
To better evaluate the VTU task in real-world scenarios, we establish a new and comprehensive table VQA benchmark, ComTQA.
arXiv Detail & Related papers (2024-06-03T13:54:05Z) - Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.
Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z) - Tables to LaTeX: structure and content extraction from scientific tables [0.848135258677752]
We adapt the transformer-based language modeling paradigm for scientific table structure and content extraction.
We achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively.
arXiv Detail & Related papers (2022-10-31T12:08:39Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Graph Neural Networks and Representation Embedding for Table Extraction
in PDF Documents [1.1859913430860336]
The main contribution of this work is to tackle the problem of table extraction, exploiting Graph Neural Networks.
We experimentally evaluated the proposed approach on a new dataset obtained by merging the information provided in the PubLayNet and PubTables-1M datasets.
arXiv Detail & Related papers (2022-08-23T21:36:01Z) - OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - TabLeX: A Benchmark Dataset for Structure and Content Information
Extraction from Scientific Tables [1.4115224153549193]
This paper presents TabLeX, a large-scale benchmark dataset comprising table images generated from scientific articles.
To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts.
Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images.
arXiv Detail & Related papers (2021-05-12T05:13:38Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.