OpenDORS: A dataset of openly referenced open research software
- URL: http://arxiv.org/abs/2512.01570v1
- Date: Mon, 01 Dec 2025 11:45:50 GMT
- Title: OpenDORS: A dataset of openly referenced open research software
- Authors: Stephan Druskat, Lars Grunske,
- Abstract summary: We present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature.<n>Each dataset record identifies the referencing publication and lists source code repositories of the software project.<n>For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files.
- Score: 1.0026496861838448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software has been formalized as research software engineering, to create better software that enables better research. Despite this, large-scale studies of research software and its development are still lacking. To enable such studies, we present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature. Each dataset record identifies the referencing publication and lists source code repositories of the software project. For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files. We summarize the distributions of these features in the dataset and describe additional software metadata that extends the dataset in future work. Finally, we suggest examples of research that could use the dataset to develop a better understanding of research software practice in RSE research.
Related papers
- Making Software FAIR: A machine-assisted workflow for the research software lifecycle [2.682583873311538]
SoFAIR will extend the capabilities of widely used open scholarly infrastructures.<n>It will deliver and deploy an effective solution for the management of the research software lifecycle.
arXiv Detail & Related papers (2025-01-08T14:17:26Z) - On the Creation of Representative Samples of Software Repositories [1.8599311233727087]
With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies.
Current sampling methods are often based on random selection or rely on variables which may not be related to the research study.
We present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study.
arXiv Detail & Related papers (2024-10-01T12:41:15Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - SciCat: A Curated Dataset of Scientific Software Repositories [4.77982299447395]
We introduce the SciCat dataset -- a comprehensive collection of Free-Libre Open Source Software (FLOSS) projects.
Our approach involves selecting projects from a pool of 131 million deforked repositories from the World of Code data source.
Our classification focuses on software designed for scientific purposes, research-related projects, and research support software.
arXiv Detail & Related papers (2023-12-11T13:46:33Z) - The Software Heritage Open Science Ecosystem [0.0]
Software Heritage is the largest public archive of software source code and associated development history.
It has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects.
It supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code.
It ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments.
arXiv Detail & Related papers (2023-10-16T11:32:03Z) - A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange.
The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Nine Best Practices for Research Software Registries and Repositories: A
Concise Guide [63.52960372153386]
We present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories.
These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Implementation Working Group during the years 2011 and 2012.
arXiv Detail & Related papers (2020-12-24T05:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.