Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention
- URL: http://arxiv.org/abs/2010.13583v2
- Date: Thu, 28 Jan 2021 02:33:37 GMT
- Title: Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention
- Authors: Linlin Hou, Ji Zhang, Ou Wu, Ting Yu, Zhen Wang, Zhao Li, Jianliang
Gao, Yingchun Ye, Rujing Yao
- Abstract summary: We propose a novel entity recognition model, called MDER, which is able to effectively extract the method and dataset entities from scientific papers.
We evaluate the proposed model on datasets constructed from the published papers of four research areas in computer science, i.e., NLP, CV, Data Mining and AI.
- Score: 21.93889297841459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Literature analysis facilitates researchers to acquire a good understanding
of the development of science and technology. The traditional literature
analysis focuses largely on the literature metadata such as topics, authors,
abstracts, keywords, references, etc., and little attention was paid to the
main content of papers. In many scientific domains such as science, computing,
engineering, etc., the methods and datasets involved in the scientific papers
published in those domains carry important information and are quite useful for
domain analysis as well as algorithm and dataset recommendation. In this paper,
we propose a novel entity recognition model, called MDER, which is able to
effectively extract the method and dataset entities from the main textual
content of scientific papers. The model utilizes rule embedding and adopts a
parallel structure of CNN and Bi-LSTM with the self-attention mechanism. We
evaluate the proposed model on datasets which are constructed from the
published papers of four research areas in computer science, i.e., NLP, CV,
Data Mining and AI. The experimental results demonstrate that our model
performs well in all the four areas and it features a good learning capacity
for cross-area learning and recognition. We also conduct experiments to
evaluate the effectiveness of different building modules within our model which
indicate that the importance of different building modules in collectively
contributing to the good entity recognition performance as a whole. The data
augmentation experiments on our model demonstrated that data augmentation
positively contributes to model training, making our model much more robust in
dealing with the scenarios where only small number of training samples are
available. We finally apply our model on PAKDD papers published from 2009-2019
to mine insightful results from scientific papers published in a longer time
span.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Human-artificial intelligence teaming for scientific information extraction from data-driven additive manufacturing research using large language models [3.0061386772253784]
Data-driven research in Additive Manufacturing (AM) has gained significant success in recent years.
This has led to a plethora of scientific literature to emerge.
It requires substantial effort and time to extract scientific information from these works.
We propose a framework that enables collaboration between AM and AI experts to continuously extract scientific information from data-driven AM literature.
arXiv Detail & Related papers (2024-07-26T15:43:52Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - A Survey of Decomposition-Based Evolutionary Multi-Objective Optimization: Part II -- A Data Science Perspective [4.322038460697958]
We build a knowledge graph that encapsulates more than 5,400 papers, 10,000 authors, 400 venues, and 1,600 institutions for MOEA/D research.
We also explore the collaboration and citation networks of MOEA/D, uncovering hidden patterns in the growth of literature.
arXiv Detail & Related papers (2024-04-22T14:38:58Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Enhancing Identification of Structure Function of Academic Articles
Using Contextual Information [6.28532577139029]
This paper takes articles of the ACL conference as the corpus to identify the structure function of academic articles.
We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input.
Inspired by (2), this paper introduces contextual information into the deep learning models and achieved significant results.
arXiv Detail & Related papers (2021-11-28T11:21:21Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.