Technical specification of a framework for the collection of clinical images and data
- URL: http://arxiv.org/abs/2508.03723v1
- Date: Tue, 29 Jul 2025 17:30:50 GMT
- Title: Technical specification of a framework for the collection of clinical images and data
- Authors: Alistair Mackenzie, Mark Halling-Brown, Ruben van Engen, Carlijn Roozemond, Lucy Warren, Dominic Ward, Nadia Smith,
- Abstract summary: Key characteristic of the main collection framework described here is that it can enable automated and ongoing collection of datasets.<n>It is important that datasets have a mix of older cases with long term follow-up.<n>Other types of collection frameworks, which do not follow a fully automated approach, are also described.
- Score: 0.10051474951635875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report a framework for the collection of clinical images and data for use when training and validating artificial intelligence (AI) tools is described. The report contains not only information about the collection of the images and clinical data, but the ethics and information governance processes to consider ensuring the data is collected safely, and the infrastructure and agreements required to allow for the sharing of data with other groups. A key characteristic of the main collection framework described here is that it can enable automated and ongoing collection of datasets to ensure that the data is up-to-date and representative of current practice. This is important in the context of training and validating AI tools as it is vital that datasets have a mix of older cases with long term follow-up such that the clinical outcome is as accurate as possible, and current data. Validations run on old data will provide findings and conclusions relative to the status of the imaging units when that data was generated. It is important that a validation dataset can assess the AI tools with data that it would see if deployed and active now. Other types of collection frameworks, which do not follow a fully automated approach, are also described. Whilst the fully automated method is recommended for large scale, long-term image collection, there may be reasons to start data collection using semi-automated methods and indications of how to do that are provided.
Related papers
- Agentic AI framework for End-to-End Medical Data Inference [5.871161259593687]
We introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference.<n>We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging.
arXiv Detail & Related papers (2025-07-24T05:56:25Z) - An Ensemble Scheme for Proactive Dominant Data Migration of Pervasive Tasks at the Edge [5.4327243200369555]
We propose a scheme to be implemented by autonomous edge nodes concerning their identifications of the appropriate data to be migrated to particular locations within the infrastructure.
Our objective is to equip nodes with the capability to comprehend the access patterns relating to offloaded data-driven tasks.
It is evident that these tasks depend on the processing of data that is absent from the original hosting nodes.
To infer these data intervals, we utilize an ensemble approach that integrates a statistically oriented model and a machine learning framework.
arXiv Detail & Related papers (2024-10-12T19:09:16Z) - HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models [16.567468717846676]
HoneyBee is a scalable modular framework for building multimodal oncology datasets.
It generates embeddings that capture the essential features and relationships within the raw medical data.
HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
arXiv Detail & Related papers (2024-05-13T04:35:14Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Collect, Measure, Repeat: Reliability Factors for Responsible AI Data
Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner.
We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z) - Surgical Phase and Instrument Recognition: How to identify appropriate
Dataset Splits [2.045596350476764]
This work presents a publicly available data visualization tool that enables interactive exploration of dataset splits.
It focuses on the visualization of the occurrence of phases, phase transitions, instruments, and instrument combinations across sets.
Results: We performed an analysis of common Cholec80 dataset splits and were able to uncover phase transitions and combinations of instruments that were not represented in one of the sets.
arXiv Detail & Related papers (2023-06-29T12:02:16Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Benchmark datasets driving artificial intelligence development fail to
capture the needs of medical professionals [4.799783526620609]
We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP)
A total of 450 NLP datasets were manually systematized and annotated with rich metadata.
Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
arXiv Detail & Related papers (2022-01-18T15:05:28Z) - Data Collection and Labeling of Real-Time IoT-Enabled Bio-Signals in
Everyday Settings for Mental Health Improvement [6.7377504888630675]
Real-time physiological data collection and analysis play a central role in modern well-being applications.
This paper builds a system for the real-time collection and analysis of photoplethysmogram, acceleration, gyroscope, and gravity data from a wearable sensor.
arXiv Detail & Related papers (2021-08-02T20:56:48Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.