Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention
- URL: http://arxiv.org/abs/2010.13583v2
- Date: Thu, 28 Jan 2021 02:33:37 GMT
- Title: Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention
- Authors: Linlin Hou, Ji Zhang, Ou Wu, Ting Yu, Zhen Wang, Zhao Li, Jianliang
Gao, Yingchun Ye, Rujing Yao
- Abstract summary: We propose a novel entity recognition model, called MDER, which is able to effectively extract the method and dataset entities from scientific papers.
We evaluate the proposed model on datasets constructed from the published papers of four research areas in computer science, i.e., NLP, CV, Data Mining and AI.
- Score: 21.93889297841459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Literature analysis facilitates researchers to acquire a good understanding
of the development of science and technology. The traditional literature
analysis focuses largely on the literature metadata such as topics, authors,
abstracts, keywords, references, etc., and little attention was paid to the
main content of papers. In many scientific domains such as science, computing,
engineering, etc., the methods and datasets involved in the scientific papers
published in those domains carry important information and are quite useful for
domain analysis as well as algorithm and dataset recommendation. In this paper,
we propose a novel entity recognition model, called MDER, which is able to
effectively extract the method and dataset entities from the main textual
content of scientific papers. The model utilizes rule embedding and adopts a
parallel structure of CNN and Bi-LSTM with the self-attention mechanism. We
evaluate the proposed model on datasets which are constructed from the
published papers of four research areas in computer science, i.e., NLP, CV,
Data Mining and AI. The experimental results demonstrate that our model
performs well in all the four areas and it features a good learning capacity
for cross-area learning and recognition. We also conduct experiments to
evaluate the effectiveness of different building modules within our model which
indicate that the importance of different building modules in collectively
contributing to the good entity recognition performance as a whole. The data
augmentation experiments on our model demonstrated that data augmentation
positively contributes to model training, making our model much more robust in
dealing with the scenarios where only small number of training samples are
available. We finally apply our model on PAKDD papers published from 2009-2019
to mine insightful results from scientific papers published in a longer time
span.
Related papers
- MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension [59.41495657570397]
We collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals.
This dataset spans 72 scientific disciplines, ensuring both diversity and quality.
We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - A Survey of Decomposition-Based Evolutionary Multi-Objective Optimization: Part II -- A Data Science Perspective [4.322038460697958]
We build a knowledge graph that encapsulates more than 5,400 papers, 10,000 authors, 400 venues, and 1,600 institutions for MOEA/D research.
We also explore the collaboration and citation networks of MOEA/D, uncovering hidden patterns in the growth of literature.
arXiv Detail & Related papers (2024-04-22T14:38:58Z) - A Survey on Data Selection for Language Models [151.6210632830082]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A Reliable Knowledge Processing Framework for Combustion Science using
Foundation Models [0.0]
The study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature.
The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy.
The framework consistently delivers accurate domain-specific responses with minimal human oversight.
arXiv Detail & Related papers (2023-12-31T17:15:25Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Enhancing Identification of Structure Function of Academic Articles
Using Contextual Information [6.28532577139029]
This paper takes articles of the ACL conference as the corpus to identify the structure function of academic articles.
We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input.
Inspired by (2), this paper introduces contextual information into the deep learning models and achieved significant results.
arXiv Detail & Related papers (2021-11-28T11:21:21Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.