D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of
Computer Science Research
- URL: http://arxiv.org/abs/2204.13384v1
- Date: Thu, 28 Apr 2022 09:59:52 GMT
- Title: D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of
Computer Science Research
- Authors: Jan Philip Wahle and Terry Ruas and Saif M. Mohammad and Bela Gipp
- Abstract summary: DBLP is the largest open-access repository of scientific articles on computer science.
We retrieved more than 6 million publications from DBLP and extracted metadata.
D3 can be used to identify trends in research activity, productivity, focus, bias, accessibility, and impact of computer science research.
- Score: 27.882505456528243
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: DBLP is the largest open-access repository of scientific articles on computer
science and provides metadata associated with publications, authors, and
venues. We retrieved more than 6 million publications from DBLP and extracted
pertinent metadata (e.g., abstracts, author affiliations, citations) from the
publication texts to create the DBLP Discovery Dataset (D3). D3 can be used to
identify trends in research activity, productivity, focus, bias, accessibility,
and impact of computer science research. We present an initial analysis focused
on the volume of computer science research (e.g., number of papers, authors,
research activity), trends in topics of interest, and citation patterns. Our
findings show that computer science is a growing research field (approx. 15%
annually), with an active and collaborative researcher community. While papers
in recent years present more bibliographical entries in comparison to previous
decades, the average number of citations has been declining. Investigating
papers' abstracts reveals that recent topic trends are clearly reflected in D3.
Finally, we list further applications of D3 and pose supplemental research
questions. The D3 dataset, our findings, and source code are publicly available
for research purposes.
Related papers
- A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A Bibliographic Study on Artificial Intelligence Research: Global
Panorama and Indian Appearance [2.9895330439073406]
The study reveals that neural networks and deep learning are the major topics included in top AI research publications.
The study also investigates the relative position of Indian researchers in terms of AI research.
arXiv Detail & Related papers (2023-07-04T05:08:36Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Analyzing the State of Computer Science Research with the DBLP Discovery
Dataset [0.0]
We conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata.
We introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations.
Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future.
arXiv Detail & Related papers (2022-12-01T16:27:42Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - Industry and Academic Research in Computer Vision [5.634825161148484]
This work aims to study the dynamic between research in the industry and academia in computer vision.
The results are demonstrated on a set of top-5 vision conferences that are representative of the field.
arXiv Detail & Related papers (2021-07-10T20:09:52Z) - Studying the characteristics of scientific communities using
individual-level bibliometrics: the case of Big Data research [2.208242292882514]
We study the academic age, production, and research focus of the community of authors active in Big Data research.
Results show that the academic realm of "Big Data" is a growing topic with an expanding community of authors.
arXiv Detail & Related papers (2021-06-10T08:17:09Z) - A Survey of Knowledge Tracing: Models, Variants, and Applications [70.69281873057619]
Knowledge Tracing is one of the fundamental tasks for student behavioral data analysis.
We present three types of fundamental KT models with distinct technical routes.
We discuss potential directions for future research in this rapidly growing field.
arXiv Detail & Related papers (2021-05-06T13:05:55Z) - Domain Generalization: A Survey [146.68420112164577]
Domain generalization (DG) aims to achieve OOD generalization by only using source domain data for model learning.
For the first time, a comprehensive literature review is provided to summarize the ten-year development in DG.
arXiv Detail & Related papers (2021-03-03T16:12:22Z) - Two Huge Title and Keyword Generation Corpora of Research Articles [0.0]
We introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research.
The data were retrieved from the Open Academic Graph which is a network of research profiles and publications.
We would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.
arXiv Detail & Related papers (2020-02-11T21:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.