Related papers: A dataset of mentorship in science with semantic and demographic estimations

A dataset of mentorship in science with semantic and demographic estimations

URL: http://arxiv.org/abs/2106.06487v1
Date: Fri, 11 Jun 2021 16:12:15 GMT
Title: A dataset of mentorship in science with semantic and demographic estimations
Authors: Qing Ke, Lizhen Liang, Ying Ding, Stephen V. David, Daniel E. Acuna
Abstract summary: We describe a crowdsourced dataset of 743176 mentorship relationships among 738989 scientists across 112 fields. We enrich the scientists' profiles with publication data from the Microsoft Academic Graph and "semantic" representations of research using deep learning content analysis. We perform extensive validations of the profile--publication matching, semantic content, and demographic inferences.
Score: 4.317131795436002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mentorship in science is crucial for topic choice, career decisions, and the success of mentees and mentors. Typically, researchers who study mentorship use article co-authorship and doctoral dissertation datasets. However, available datasets of this type focus on narrow selections of fields and miss out on early career and non-publication-related interactions. Here, we describe MENTORSHIP, a crowdsourced dataset of 743176 mentorship relationships among 738989 scientists across 112 fields that avoids these shortcomings. We enrich the scientists' profiles with publication data from the Microsoft Academic Graph and "semantic" representations of research using deep learning content analysis. Because gender and race have become critical dimensions when analyzing mentorship and disparities in science, we also provide estimations of these factors. We perform extensive validations of the profile--publication matching, semantic content, and demographic inferences. We anticipate this dataset will spur the study of mentorship in science and deepen our understanding of its role in scientists' career outcomes.

Related papers

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy [59.32718342798908]
We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain.<n>We present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants.
arXiv Detail & Related papers (2025-05-26T21:49:18Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and Why? [84.46288849132634]
We propose a systematic framework for analyzing the evolution of research topics in a scientific field using causal discovery and inference techniques. We define three variables to encompass diverse facets of the evolution of research topics within NLP. We utilize a causal discovery algorithm to unveil the causal connections among these variables using observational data.
arXiv Detail & Related papers (2023-05-22T11:08:00Z)
Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions. To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter. Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z)
How Data Scientists Review the Scholarly Literature [4.406926847270567]
We examine the literature review practices of data scientists. Data science represents a field seeing an exponential rise in papers. No prior work has examined the specific practices and challenges faced by these scientists.
arXiv Detail & Related papers (2023-01-10T03:53:05Z)
SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse [2.3371548697609303]
Scientific topics, claims and resources are increasingly debated as part of online discourse. This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines. Research across disciplines currently suffers from a lack of robust definitions of the various forms of science-relatedness.
arXiv Detail & Related papers (2022-06-15T08:14:55Z)
Evaluating the state-of-the-art in mapping research spaces: a Brazilian case study [0.0]
Two recent works propose methods for creating research maps from scientists' publication records. We evaluate these models' ability to predict whether a given entity will enter a new field. We conduct a case study to showcase how these models can be used to characterize science dynamics in the context of Brazil.
arXiv Detail & Related papers (2021-04-07T18:14:41Z)
Early Indicators of Scientific Impact: Predicting Citations with Altmetrics [0.0]
We use altmetrics to predict the short-term and long-term citations that a scholarly publication could receive. We build various classification and regression models and evaluate their performance, finding neural networks and ensemble models to perform best for these tasks.
arXiv Detail & Related papers (2020-12-25T16:25:07Z)
A Survey of Embedding Space Alignment Methods for Language and Knowledge Graphs [77.34726150561087]
We survey the current research landscape on word, sentence and knowledge graph embedding algorithms. We provide a classification of the relevant alignment techniques and discuss benchmark datasets used in this field of research.
arXiv Detail & Related papers (2020-10-26T16:08:13Z)
Biases in Data Science Lifecycle [0.0]
The aim of this study is to provide a practical guideline to data scientists and increase their awareness. In this work, we reviewed different sources of biases and grouped them under different stages of the data science lifecycle.
arXiv Detail & Related papers (2020-09-10T13:41:48Z)
REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset. It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.