Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature
- URL: http://arxiv.org/abs/2203.05112v1
- Date: Thu, 10 Mar 2022 02:11:30 GMT
- Title: Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature
- Authors: Lizhou Fan, Sara Lafia, David Bleckley, Elizabeth Moss, Andrea Thomer,
Libby Hemphill
- Abstract summary: We propose a natural language processing paradigm to support the human task of identifying informal mentions made to research datasets.
The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research.
- Score: 1.4190701053683017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data citations provide a foundation for studying research data impact.
Collecting and managing data citations is a new frontier in archival science
and scholarly communication. However, the discovery and curation of research
data citations is labor intensive. Data citations that reference unique
identifiers (i.e. DOIs) are readily findable; however, informal mentions made
to research data are more challenging to infer. We propose a natural language
processing (NLP) paradigm to support the human task of identifying informal
mentions made to research datasets. The work of discovering informal data
mentions is currently performed by librarians and their staff in the
Inter-university Consortium for Political and Social Research (ICPSR), a large
social science data archive that maintains a large bibliography of data-related
literature. The NLP model is bootstrapped from data citations actively
collected by librarians at ICPSR. The model combines pattern matching with
multiple iterations of human annotations to learn additional rules for
detecting informal data mentions. These examples are then used to train an NLP
pipeline. The librarian-in-the-loop paradigm is centered in the data work
performed by ICPSR librarians, supporting broader efforts to build a more
comprehensive bibliography of data-related literature that reflects the
scholarly communities of research data users.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Natural Language Processing for Drug Discovery Knowledge Graphs:
promises and pitfalls [0.0]
Building and analysing knowledge graphs (KGs) to aid drug discovery is a topical area of research.
We discuss promises and pitfalls of using natural language processing (NLP) to mine unstructured text as a data source for KGs.
arXiv Detail & Related papers (2023-10-24T07:35:24Z) - NLPeer: A Unified Resource for the Computational Study of Peer Review [58.71736531356398]
We introduce NLPeer -- the first ethically sourced multidomain corpus of more than 5k papers and 11k review reports from five different venues.
We augment previous peer review datasets to include parsed and structured paper representations, rich metadata and versioning information.
Our work paves the path towards systematic, multi-faceted, evidence-based study of peer review in NLP and beyond.
arXiv Detail & Related papers (2022-11-12T12:29:38Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - A Natural Language Processing Pipeline for Detecting Informal Data
References in Academic Literature [1.8692254863855962]
We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets.
The pipeline increases recall for literature to review for inclusion in data-related collections of publications.
We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference.
arXiv Detail & Related papers (2022-05-23T22:06:46Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.