A Natural Language Processing Pipeline for Detecting Informal Data
References in Academic Literature
- URL: http://arxiv.org/abs/2205.11651v1
- Date: Mon, 23 May 2022 22:06:46 GMT
- Title: A Natural Language Processing Pipeline for Detecting Informal Data
References in Academic Literature
- Authors: Sara Lafia, Lizhou Fan, Libby Hemphill
- Abstract summary: We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets.
The pipeline increases recall for literature to review for inclusion in data-related collections of publications.
We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference.
- Score: 1.8692254863855962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discovering authoritative links between publications and the datasets that
they use can be a labor-intensive process. We introduce a natural language
processing pipeline that retrieves and reviews publications for informal
references to research datasets, which complements the work of data librarians.
We first describe the components of the pipeline and then apply it to expand an
authoritative bibliography linking thousands of social science studies to the
data-related publications in which they are used. The pipeline increases recall
for literature to review for inclusion in data-related collections of
publications and makes it possible to detect informal data references at scale.
We contribute (1) a novel Named Entity Recognition (NER) model that reliably
detects informal data references and (2) a dataset connecting items from social
science literature with datasets they reference. Together, these contributions
enable future work on data reference, data citation networks, and data reuse.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - [Citation needed] Data usage and citation practices in medical imaging conferences [1.9702506447163306]
We present two open-source tools that could help with the detection of dataset usage.
We studied the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL.
Our findings demonstrate the concentration of the usage of a limited set of datasets.
arXiv Detail & Related papers (2024-02-05T13:41:22Z) - Natural Language Processing for Drug Discovery Knowledge Graphs:
promises and pitfalls [0.0]
Building and analysing knowledge graphs (KGs) to aid drug discovery is a topical area of research.
We discuss promises and pitfalls of using natural language processing (NLP) to mine unstructured text as a data source for KGs.
arXiv Detail & Related papers (2023-10-24T07:35:24Z) - SciLit: A Platform for Joint Scientific Literature Discovery,
Summarization and Citation Generation [11.186252009101077]
We propose SciLit, a pipeline that automatically recommends relevant papers, extracts highlights, and suggests a reference sentence as a citation of a paper.
SciLit efficiently recommends papers from large databases of hundreds of millions of papers using a two-stage pre-fetching and re-ranking literature search system.
arXiv Detail & Related papers (2023-06-06T09:34:45Z) - Inline Citation Classification using Peripheral Context and
Time-evolving Augmentation [23.88211560188731]
We propose a new dataset, named 3Cext, which provides discourse information using the cited sentences.
We propose PeriCite, a Transformer-based deep neural network that fuses peripheral sentences and domain knowledge.
arXiv Detail & Related papers (2023-03-01T09:11:07Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature [1.4190701053683017]
We propose a natural language processing paradigm to support the human task of identifying informal mentions made to research datasets.
The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research.
arXiv Detail & Related papers (2022-03-10T02:11:30Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.