Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature
- URL: http://arxiv.org/abs/2403.14721v2
- Date: Wed, 27 Mar 2024 17:36:09 GMT
- Title: Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature
- Authors: Jeremy R. Harper,
- Abstract summary: This study introduces an automated methodology to bridge the gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv.
Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools across various repositories, with no centralized knowledge base, poses significant challenges to leveraging their full potential. This study introduces an automated methodology to bridge this gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv, focusing on the field of clinical informatics. Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors. The process is designed to be robust, incorporating error handling and rate limiting to ensure compliance with API constraints. Preliminary findings demonstrate the efficacy of this methodology in compiling a centralized knowledge base of NIH-funded software tools, laying the groundwork for an enriched understanding and utilization of these resources within the clinical informatics community. We propose the future integration of Large Language Models (LLMs) to generate concise summaries and evaluations of the tools. This approach facilitates the discovery and assessment of clinical informatics tools and also enables ongoing monitoring of new and actively updated repositories, revolutionizing how researchers access and leverage federally funded software. The implications of this study extend beyond simplification of access to valuable resources; it proposes a scalable model for the dynamic aggregation and evaluation of scientific software, encouraging more collaborative, transparent, and efficient research practices in clinical informatics and beyond.
Related papers
- Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications [0.0]
The IEEE Biocompute Object (BCO) standard addresses the need but faces adoption challenges due to the overhead of creating compliant documentation.
This paper presents a novel approach to automate the creation of BCOs from scientific papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs)
The implementation incorporates optimized retrieval processes, including a two-pass retrieval with re-ranking, and employs carefully engineered prompts for each BCO domain.
arXiv Detail & Related papers (2024-09-23T14:51:22Z) - The Future of Scientific Publishing: Automated Article Generation [0.0]
This study introduces a novel software tool leveraging large language model (LLM) prompts, designed to automate the generation of academic articles from Python code.
Python served as a foundational proof of concept; however, the underlying methodology and framework exhibit adaptability across various GitHub repo's.
The development was achieved without reliance on advanced language model agents, ensuring high fidelity in the automated generation of coherent and comprehensive academic content.
arXiv Detail & Related papers (2024-04-11T16:47:02Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange.
The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - TemporAI: Facilitating Machine Learning Innovation in Time Domain Tasks
for Medicine [91.3755431537592]
TemporAI is an open source Python software library for machine learning (ML) tasks involving data with a time component.
It supports data in time series, static, and eventmodalities and provides an interface for prediction, causal inference, and time-to-event analysis.
arXiv Detail & Related papers (2023-01-28T17:57:53Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - Therapeutics Data Commons: Machine Learning Datasets and Tasks for
Therapeutics [84.94299203422658]
Therapeutics Data Commons is a framework to systematically access and evaluate machine learning across the entire range of therapeutics.
At its core, TDC is a collection of curated datasets and learning tasks that can translate algorithmic innovation into biomedical and clinical implementation.
TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles.
arXiv Detail & Related papers (2021-02-18T18:50:31Z) - Open Source Software for Efficient and Transparent Reviews [0.11179881480027788]
ASReview is an open source machine learning-aided pipeline applying active learning.
We demonstrate by means of simulation studies that ASReview can yield far more efficient reviewing than manual reviewing.
arXiv Detail & Related papers (2020-06-22T11:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.