Data Set Terminology of Deep Learning in Medicine: A Historical Review and Recommendation
- URL: http://arxiv.org/abs/2404.19303v2
- Date: Tue, 18 Jun 2024 09:49:49 GMT
- Title: Data Set Terminology of Deep Learning in Medicine: A Historical Review and Recommendation
- Authors: Shannon L. Walston, Hiroshi Seki, Hirotaka Takita, Yasuhito Mitsuyama, Shingo Sato, Akifumi Hagiwara, Rintaro Ito, Shouhei Hanaoka, Yukio Miki, Daiju Ueda,
- Abstract summary: Medicine and deep learning-based artificial intelligence engineering represent two distinct fields each with decades of published history.
With such history comes a set of terminology that has a specific way in which it is applied.
This review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts.
- Score: 0.7897552065199818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medicine and deep learning-based artificial intelligence (AI) engineering represent two distinct fields each with decades of published history. With such history comes a set of terminology that has a specific way in which it is applied. However, when two distinct fields with overlapping terminology start to collaborate, miscommunication and misunderstandings can occur. This narrative review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts, and offer solutions to mitigate misunderstandings by readers from either field. Through an examination of historical documents, including articles, writing guidelines, and textbooks, this review traces the divergent evolution of terms for data sets and their impact. Initially, the discordant interpretations of the word 'validation' in medical and AI contexts are explored. Then the data sets used for AI evaluation are classified, namely random splitting, cross-validation, temporal, geographic, internal, and external sets. The accurate and standardized description of these data sets is crucial for demonstrating the robustness and generalizability of AI applications in medicine. This review clarifies existing literature to provide a comprehensive understanding of these classifications and their implications in AI evaluation. This review then identifies often misunderstood terms and proposes pragmatic solutions to mitigate terminological confusion. Among these solutions are the use of standardized terminology such as 'training set,' 'validation (or tuning) set,' and 'test set,' and explicit definition of data set splitting terminologies in each medical AI research publication. This review aspires to enhance the precision of communication in medical AI, thereby fostering more effective and transparent research methodologies in this interdisciplinary field.
Related papers
- Leveraging Ontologies to Document Bias in Data [1.0635248457021496]
Doc-BiasO is a resource that aims to create an integrated vocabulary of biases defined in the textitfair-ML literature and their measures.
Our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI.
arXiv Detail & Related papers (2024-06-29T18:41:07Z) - Situated Ground Truths: Enhancing Bias-Aware AI by Situating Data Labels with SituAnnotate [0.1843404256219181]
SituAnnotate is a novel ontology-based approach to structured and context-aware data annotation.
It aims to anchor the ground truth data employed in training AI systems within the contextual and culturally-bound situations.
As a method to create, query, and compare label-based datasets, SituAnnotate empowers downstream AI systems to undergo training with explicit consideration of context and cultural bias.
arXiv Detail & Related papers (2024-06-10T09:33:13Z) - AI Hallucinations: A Misnomer Worth Clarifying [4.880243880711163]
We present and analyze definitions obtained across all databases, categorize them based on their applications, and extract key points within each category.
Our results highlight a lack of consistency in how the term is used, but also help identify several alternative terms in the literature.
arXiv Detail & Related papers (2024-01-09T01:49:41Z) - Biomedical Named Entity Recognition via Dictionary-based Synonym
Generalization [51.89486520806639]
We propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions.
We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins.
arXiv Detail & Related papers (2023-05-22T14:36:32Z) - "Nothing Abnormal": Disambiguating Medical Reports via Contrastive
Knowledge Infusion [6.9551174393701345]
We propose a rewriting algorithm based on contrastive pretraining and perturbation-based rewriting.
We create two datasets, OpenI-Annotated based on chest reports and VA-Annotated based on general medical reports.
Our proposed algorithm effectively rewrites input sentences in a less ambiguous way with high content fidelity.
arXiv Detail & Related papers (2023-05-15T02:01:20Z) - EBOCA: Evidences for BiOmedical Concepts Association Ontology [55.41644538483948]
This paper proposes EBOCA, an ontology that describes (i) biomedical domain concepts and associations between them, and (ii) evidences supporting these associations.
Test data coming from a subset of DISNET and automatic association extractions from texts has been transformed to create a Knowledge Graph that can be used in real scenarios.
arXiv Detail & Related papers (2022-08-01T18:47:03Z) - Clinical Named Entity Recognition using Contextualized Token
Representations [49.036805795072645]
This paper introduces the technique of contextualized word embedding to better capture the semantic meaning of each word based on its context.
We pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair)
Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
arXiv Detail & Related papers (2021-06-23T18:12:58Z) - Semi-Supervised Variational Reasoning for Medical Dialogue Generation [70.838542865384]
Two key characteristics are relevant for medical dialogue generation: patient states and physician actions.
We propose an end-to-end variational reasoning approach to medical dialogue generation.
A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability.
arXiv Detail & Related papers (2021-05-13T04:14:35Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Unifying Relational Sentence Generation and Retrieval for Medical Image
Report Composition [142.42920413017163]
Current methods often generate the most common sentences due to dataset bias for individual case.
We propose a novel framework that unifies template retrieval and sentence generation to handle both common and rare abnormality.
arXiv Detail & Related papers (2021-01-09T04:33:27Z) - On the Combined Use of Extrinsic Semantic Resources for Medical
Information Search [0.0]
We develop a framework to highlight and expand head medical concepts in verbose medical queries.
We also build semantically enhanced inverted index documents.
To demonstrate the effectiveness of the proposed approach, we conducted several experiments over the CLEF 2014 dataset.
arXiv Detail & Related papers (2020-05-17T14:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.