Therapeutics Data Commons: Machine Learning Datasets and Tasks for
Therapeutics
- URL: http://arxiv.org/abs/2102.09548v1
- Date: Thu, 18 Feb 2021 18:50:31 GMT
- Title: Therapeutics Data Commons: Machine Learning Datasets and Tasks for
Therapeutics
- Authors: Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure
Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, Marinka Zitnik
- Abstract summary: Therapeutics Data Commons is a framework to systematically access and evaluate machine learning across the entire range of therapeutics.
At its core, TDC is a collection of curated datasets and learning tasks that can translate algorithmic innovation into biomedical and clinical implementation.
TDC also provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles.
- Score: 84.94299203422658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning for therapeutics is an emerging field with incredible
opportunities for innovation and expansion. Despite the initial success, many
key challenges remain open. Here, we introduce Therapeutics Data Commons (TDC),
the first unifying framework to systematically access and evaluate machine
learning across the entire range of therapeutics. At its core, TDC is a
collection of curated datasets and learning tasks that can translate
algorithmic innovation into biomedical and clinical implementation. To date,
TDC includes 66 machine learning-ready datasets from 22 learning tasks,
spanning the discovery and development of safe and effective medicines. TDC
also provides an ecosystem of tools, libraries, leaderboards, and community
resources, including data functions, strategies for systematic model
evaluation, meaningful data splits, data processors, and molecule generation
oracles. All datasets and learning tasks are integrated and accessible via an
open-source library. We envision that TDC can facilitate algorithmic and
scientific advances and accelerate development, validation, and transition into
production and clinical implementation. TDC is a continuous, open-source
initiative, and we invite contributions from the research community. TDC is
publicly available at https://tdcommons.ai.
Related papers
- Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature [0.0]
This study introduces an automated methodology to bridge the gap by systematically extracting GitHub repository URLs from academic papers indexed in arXiv.
Our approach encompasses querying the arXiv API for relevant papers, cleaning extracted GitHub URLs, fetching comprehensive repository information via the GitHub API, and analyzing repository maturity based on defined metrics such as stars, forks, open issues, and contributors.
arXiv Detail & Related papers (2024-03-20T17:06:51Z) - Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data
Generation with Large Language Models [48.07083163501746]
Clinical natural language processing requires methods that can address domain-specific challenges.
We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.
Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z) - Building Flexible, Scalable, and Machine Learning-ready Multimodal
Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS)
MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources.
By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z) - Efficient Large Scale Medical Image Dataset Preparation for Machine
Learning Applications [0.08484806297945031]
This paper introduces an innovative data curation tool, developed as part of the Kaapana open-source toolkit.
The tool is specifically tailored to meet the needs of radiologists and machine learning researchers.
It incorporates advanced search, auto-annotation and efficient tagging functionalities for improved data curation.
arXiv Detail & Related papers (2023-09-29T14:41:02Z) - Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application [0.27027468002793437]
Information Extraction can help clinical practitioners overcome the limitation by using automated text-mining pipelines.
We created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model.
The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach.
arXiv Detail & Related papers (2023-06-08T16:15:46Z) - Incomplete Multimodal Learning for Complex Brain Disorders Prediction [65.95783479249745]
We propose a new incomplete multimodal data integration approach that employs transformers and generative adversarial networks.
We apply our new method to predict cognitive degeneration and disease outcomes using the multimodal imaging genetic data from Alzheimer's Disease Neuroimaging Initiative cohort.
arXiv Detail & Related papers (2023-05-25T16:29:16Z) - Deep Anatomical Federated Network (Dafne): an open client/server
framework for the continuous collaborative improvement of deep-learning-based
medical image segmentation [0.0]
The Dafne solution implements continuously evolving deep learning models exploiting the collective knowledge of the users of the system.
Dafne is the first decentralized, collaborative solution that implements continuously evolving deep learning models exploiting the collective knowledge of the users of the system.
The models deployed through Dafne are able to improve their performance over time and to generalize to data types not seen in the training sets.
arXiv Detail & Related papers (2023-02-13T13:35:09Z) - Dissecting Self-Supervised Learning Methods for Surgical Computer Vision [51.370873913181605]
Self-Supervised Learning (SSL) methods have begun to gain traction in the general computer vision community.
The effectiveness of SSL methods in more complex and impactful domains, such as medicine and surgery, remains limited and unexplored.
We present an extensive analysis of the performance of these methods on the Cholec80 dataset for two fundamental and popular tasks in surgical context understanding, phase recognition and tool presence detection.
arXiv Detail & Related papers (2022-07-01T14:17:11Z) - Federated Cycling (FedCy): Semi-supervised Federated Learning of
Surgical Phases [57.90226879210227]
FedCy is a semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos.
We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases.
arXiv Detail & Related papers (2022-03-14T17:44:53Z) - Surgical Data Science -- from Concepts toward Clinical Translation [67.543698133416]
Surgical Data Science aims to improve the quality of interventional healthcare through the capture, organization, analysis and modeling of data.
We shed light on the underlying reasons and provide a roadmap for future advances in the field.
arXiv Detail & Related papers (2020-10-30T14:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.