Related papers: All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

URL: http://arxiv.org/abs/2311.08189v3
Date: Mon, 18 Dec 2023 04:01:35 GMT
Title: All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
Authors: Yuhan Li and Jian Wu and Zhiwei Yu and B\"orje F. Karlsson and Wei Shen and Manabu Okumura and Chin-Yew Lin
Abstract summary: We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
Score: 39.05577374775964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.

Related papers

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction [11.458594744457521]
Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources. Most datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation - the ability of a machine learning model to perform well on new, unseen domains (text types) - is under-explored. We build a benchmark for adverse drug event extraction, which we named MultiADE.
arXiv Detail & Related papers (2024-05-28T09:57:28Z)
Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts. We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings. We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z)
ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts. In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z)
Schema-Driven Information Extraction from Heterogeneous Tables [37.50854811537401]
We present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels.
arXiv Detail & Related papers (2023-05-23T17:58:10Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers. We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain [11.085048329202335]
We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition. We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
arXiv Detail & Related papers (2020-06-04T17:49:34Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.