ACL-Fig: A Dataset for Scientific Figure Classification
- URL: http://arxiv.org/abs/2301.12293v1
- Date: Sat, 28 Jan 2023 20:27:35 GMT
- Title: ACL-Fig: A Dataset for Scientific Figure Classification
- Authors: Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, C.
Lee Giles
- Abstract summary: We develop a pipeline that extracts figures and tables from the scientific literature and a deep-learning-based framework that classifies scientific figures using visual features.
We build the first large-scale automatically annotated corpus, ACL-Fig, consisting of 112,052 scientific figures extracted from 56K research papers in the ACL Anthology.
The ACL-Fig-Pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories.
- Score: 15.241086410108512
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most existing large-scale academic search engines are built to retrieve
text-based information. However, there are no large-scale retrieval services
for scientific figures and tables. One challenge for such services is
understanding scientific figures' semantics, such as their types and purposes.
A key obstacle is the need for datasets containing annotated scientific figures
and tables, which can then be used for classification, question-answering, and
auto-captioning. Here, we develop a pipeline that extracts figures and tables
from the scientific literature and a deep-learning-based framework that
classifies scientific figures using visual features. Using this pipeline, we
built the first large-scale automatically annotated corpus, ACL-Fig, consisting
of 112,052 scientific figures extracted from ~56K research papers in the ACL
Anthology. The ACL-Fig-Pilot dataset contains 1,671 manually labeled scientific
figures belonging to 19 categories. The dataset is accessible at
https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - The ACL OCL Corpus: Advancing Open Science in Computational Linguistics [19.282407097200917]
The ACL OCL spans seven decades, containing 73K papers, alongside 210K figures.
By detecting paper topics with a supervised neural model, we note that interest in "Syntax: Tagging, Chunking and Parsing" is waning and "hugging Language Generation" is resurging.
arXiv Detail & Related papers (2023-05-24T10:35:56Z) - S2abEL: A Dataset for Entity Linking from Scientific Tables [15.300960829210164]
We present the first dataset for entity linking in scientific tables.
Our dataset, S2abEL, focuses on EL in machine learning results tables.
We introduce a neural baseline method designed for EL on scientific tables containing many out-of-knowledge-base mentions.
arXiv Detail & Related papers (2023-04-30T02:07:22Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - SciCap: Generating Captions for Scientific Figures [20.696070723932866]
We introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020.
After pre-processing, SCICAP contained more than two million figures extracted from over 290,000 papers.
We established baseline models that caption graph plots, the dominant (19.2%) figure type.
arXiv Detail & Related papers (2021-10-22T07:10:41Z) - TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics [32.4845534482475]
We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
arXiv Detail & Related papers (2021-01-25T17:54:06Z) - COVID-19 Knowledge Graph: Accelerating Information Retrieval and
Discovery for Scientific Literature [23.279540233851993]
coronavirus disease (COVID-19) has claimed the lives of over 350,000 people and infected more than 6 million people worldwide.
Several search engines have surfaced to provide researchers with additional tools to find and retrieve information from the rapidly growing corpora on COVID-19.
We present the COVID-19 Knowledge Graph (CKG), a heterogeneous graph for extracting and visualizing complex relationships between COVID-19 articles.
arXiv Detail & Related papers (2020-07-24T18:29:43Z) - Informational Space of Meaning for Scientific Texts [68.8204255655161]
We introduce the Meaning Space, in which the meaning of a word is represented by a vector of Relative Information Gain (RIG) about the subject categories that the text belongs to.
This new approach is applied to construct the Meaning Space based on Leicester Scientific Corpus (LSC) and Leicester Scientific Dictionary-Core (LScDC)
The most informative words are presented for 252 categories. The proposed model based on RIG is shown to have ability to stand out topic-specific words in categories.
arXiv Detail & Related papers (2020-04-28T14:26:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.