Related papers: Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective

Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective

URL: http://arxiv.org/abs/2509.10432v2
Date: Tue, 16 Sep 2025 20:37:41 GMT
Title: Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective
Authors: Harry Caufield, Satrajit Ghosh, Sek Wong Kong, Jillian Parker, Nathan Sheffield, Bhavesh Patel, Andrew Williams, Timothy Clark, Monica C. Munoz-Torres,
Abstract summary: Bridge2AI has defined the criteria a dataset may possess to render it AI-ready.<n>These criteria include FAIRness, provenance, degree of characterization, explainability, sustainability, and computability.<n>This report assesses the state of metadata creation and standardization in the Bridge2AI Grand Challenges.
Score: 1.0389904886733017
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI-readiness describes the degree to which data may be optimally and ethically used for subsequent AI and Machine Learning (AI/ML) methods, where those methods may involve some combination of model training, data classification, and ethical, explainable prediction. The Bridge2AI consortium has defined the particular criteria a biomedical dataset may possess to render it AI-ready: in brief, a dataset's readiness is related to its FAIRness, provenance, degree of characterization, explainability, sustainability, and computability, in addition to its accompaniment with documentation about ethical data practices. To ensure AI-readiness and to clarify data structure and relationships within Bridge2AI's Grand Challenges (GCs), particular types of metadata are necessary. The GCs within the Bridge2AI initiative include four data-generating projects focusing on generating AI/ML-ready datasets to tackle complex biomedical and behavioral research problems. These projects develop standardized, multimodal data, tools, and training resources to support AI integration, while addressing ethical data practices. Examples include using voice as a biomarker, building interpretable genomic tools, modeling disease trajectories with diverse multimodal data, and mapping cellular and molecular health indicators across the human body. This report assesses the state of metadata creation and standardization in the Bridge2AI GCs, provides guidelines where required, and identifies gaps and areas for improvement across the program. New projects, including those outside the Bridge2AI consortium, would benefit from what we have learned about creating metadata as part of efforts to promote AI readiness.

Related papers

Securing Dual-Use Pathogen Data of Concern [4.518583284698333]
Training data is an essential input into creating competent artificial intelligence (AI) models.<n>Data controls may be among the most high-leverage interventions available to reduce the proliferation of concerning biological AI capabilities.
arXiv Detail & Related papers (2026-02-08T17:11:19Z)
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive [50.468138755368805]
Opioid crisis represents a significant moment in public health.<n>Data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA)<n>In this paper, we tackle this challenge by organizing the original dataset according to document attributes.
arXiv Detail & Related papers (2025-11-13T03:27:32Z)
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models [59.17570021208177]
PyTDC is a machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models.<n>This paper discusses the components of PyTDC's architecture and, to our knowledge, the first-of-its-kind case study on the introduced single-cell drug-target nomination ML task.
arXiv Detail & Related papers (2025-05-08T18:15:38Z)
A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI [70.06771291117965]
We introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset.<n>Biomedica contains over 6 million scientific articles and 24 million image-text pairs.<n>We provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems.
arXiv Detail & Related papers (2025-03-26T05:56:46Z)
Biomedical Foundation Model: A Survey [84.26268124754792]
Foundation models are large-scale pre-trained models that learn from extensive unlabeled datasets.<n>These models can be adapted to various applications such as question answering and visual understanding.<n>This survey explores the potential of foundation models across diverse domains within biomedical fields.
arXiv Detail & Related papers (2025-03-03T22:42:00Z)
Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation [0.0]
Bias, data incompleteness, and inaccuracies in training datasets can lead to unfair outcomes and amplify existing disparities.<n>We propose a dataset documentation framework that promotes transparency and ensures alignment with regulatory requirements.<n>The findings emphasise the importance of dataset documentation in fostering responsible AI development.
arXiv Detail & Related papers (2025-01-09T23:36:34Z)
DREAMS: A python framework for Training Deep Learning Models on EEG Data with Model Card Reporting for Medical Applications [7.2934799091933815]
We introduce DREAMS, a Python-based framework designed to generate automated model cards for deep learning models applied to EEG data.<n>Unlike generic model reporting tools, DREAMS is specifically tailored for EEG-based deep learning applications.<n>The framework seamlessly integrates with deep learning pipelines, providing structured YAML-based documentation.
arXiv Detail & Related papers (2024-09-26T13:12:13Z)
DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data [4.870043547158868]
We propose a method to automatically classify AI-related documents from large-scale literature databases.<n>The dataset construction approach integrates expert knowledge with the capabilities of advanced models.
arXiv Detail & Related papers (2024-08-23T07:05:12Z)
Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets [17.774341783844026]
This work proposes Multimodal Integration of Oncology Data System (MINDS) MINDS is a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability.
arXiv Detail & Related papers (2023-09-30T15:44:39Z)
PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Our approach fuses image and textual data to enhance the generation process. We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z)
2021 BEETL Competition: Advancing Transfer Learning for Subject Independence & Heterogenous EEG Data Sets [89.84774119537087]
We design two transfer learning challenges around diagnostics and Brain-Computer-Interfacing (BCI) Task 1 is centred on medical diagnostics, addressing automatic sleep stage annotation across subjects. Task 2 is centred on Brain-Computer Interfacing (BCI), addressing motor imagery decoding across both subjects and data sets.
arXiv Detail & Related papers (2022-02-14T12:12:20Z)
Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals [4.799783526620609]
We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP) A total of 450 NLP datasets were manually systematized and annotated with rich metadata. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
arXiv Detail & Related papers (2022-01-18T15:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.