All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction
- URL: http://arxiv.org/abs/2311.08189v3
- Date: Mon, 18 Dec 2023 04:01:35 GMT
- Title: All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction
- Authors: Yuhan Li and Jian Wu and Zhiwei Yu and B\"orje F. Karlsson and Wei
Shen and Manabu Okumura and Chin-Yew Lin
- Abstract summary: We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
- Score: 39.05577374775964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction [11.458594744457521]
Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources.
One unanswered question is how far we are from having a single ADE extraction model that are effective on various types of text.
We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named MultiADE.
arXiv Detail & Related papers (2024-05-28T09:57:28Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.
We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages.
Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain [11.085048329202335]
We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications.
A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition.
We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
arXiv Detail & Related papers (2020-06-04T17:49:34Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.