TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural
Language Processing
- URL: http://arxiv.org/abs/2203.10839v2
- Date: Wed, 3 Aug 2022 03:18:00 GMT
- Title: TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural
Language Processing
- Authors: Mucheng Ren, Heyan Huang, Yuxiang Zhou, Qianwen Cao, Yuan Bu, Yang Gao
- Abstract summary: We focus on the core task of the TCM diagnosis and treatment system -- syndrome differentiation (SD)
Our dataset contains 54,152 real-world clinical records covering 148 syndromes.
We propose a domain-specific pre-trained language model, called ZY-BERT.
- Score: 31.190757020836656
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapy
that has spread and been applied worldwide. The unique TCM diagnosis and
treatment system requires a comprehensive analysis of a patient's symptoms
hidden in the clinical record written in free text. Prior studies have shown
that this system can be informationized and intelligentized with the aid of
artificial intelligence (AI) technology, such as natural language processing
(NLP). However, existing datasets are not of sufficient quality nor quantity to
support the further development of data-driven AI technology in TCM. Therefore,
in this paper, we focus on the core task of the TCM diagnosis and treatment
system -- syndrome differentiation (SD) -- and we introduce the first public
large-scale dataset for SD, called TCM-SD. Our dataset contains 54,152
real-world clinical records covering 148 syndromes. Furthermore, we collect a
large-scale unlabelled textual corpus in the field of TCM and propose a
domain-specific pre-trained language model, called ZY-BERT. We conducted
experiments using deep neural networks to establish a strong performance
baseline, reveal various challenges in SD, and prove the potential of
domain-specific pre-trained language model. Our study and analysis reveal
opportunities for incorporating computer science and linguistics knowledge to
explore the empirical validity of TCM theories.
Related papers
- Diagnostic Reasoning in Natural Language: Computational Model and Application [68.47402386668846]
We investigate diagnostic abductive reasoning (DAR) in the context of language-grounded tasks (NL-DAR)
We propose a novel modeling framework for NL-DAR based on Pearl's structural causal models.
We use the resulting dataset to investigate the human decision-making process in NL-DAR.
arXiv Detail & Related papers (2024-09-09T06:55:37Z) - A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis [51.07114445705692]
neurodegenerative diseases (NDs) traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring.
As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs.
The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification.
arXiv Detail & Related papers (2024-05-21T06:44:40Z) - AI Framework for Early Diagnosis of Coronary Artery Disease: An
Integration of Borderline SMOTE, Autoencoders and Convolutional Neural
Networks Approach [0.44998333629984877]
We develop a methodology for balancing and augmenting data for more accurate prediction when the data is imbalanced and the sample size is small.
The experimental results revealed that the average accuracy of our proposed method for CAD prediction was 95.36, and was higher than random forest (RF), decision tree (DT), support vector machine (SVM), logistic regression (LR), and artificial neural network (ANN)
arXiv Detail & Related papers (2023-08-29T14:33:38Z) - Leveraging text data for causal inference using electronic health records [1.4182510510164876]
This paper presents a unified framework for leveraging text data to support causal inference with electronic health data.
We show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect.
We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited.
arXiv Detail & Related papers (2023-06-09T16:06:02Z) - Advancing Italian Biomedical Information Extraction with
Transformers-based Models: Methodological Insights and Multicenter Practical
Application [0.27027468002793437]
Information Extraction can help clinical practitioners overcome the limitation by using automated text-mining pipelines.
We created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model.
The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach.
arXiv Detail & Related papers (2023-06-08T16:15:46Z) - Incomplete Multimodal Learning for Complex Brain Disorders Prediction [65.95783479249745]
We propose a new incomplete multimodal data integration approach that employs transformers and generative adversarial networks.
We apply our new method to predict cognitive degeneration and disease outcomes using the multimodal imaging genetic data from Alzheimer's Disease Neuroimaging Initiative cohort.
arXiv Detail & Related papers (2023-05-25T16:29:16Z) - DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language
Processing [5.022185333260402]
Diagnostic Reasoning Benchmarks, DR.BENCH, is a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability.
DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models.
arXiv Detail & Related papers (2022-09-29T16:05:53Z) - Towards Structuring Real-World Data at Scale: Deep Learning for
Extracting Key Oncology Information from Clinical Text with Patient-Level
Supervision [10.929271646369887]
The majority of detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents.
Traditional rule-based systems are vulnerable to the prevalent linguistic variations and ambiguities in clinical text.
We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information.
arXiv Detail & Related papers (2022-03-20T03:42:03Z) - Cross-Modality Deep Feature Learning for Brain Tumor Segmentation [158.8192041981564]
This paper proposes a novel cross-modality deep feature learning framework to segment brain tumors from the multi-modality MRI data.
The core idea is to mine rich patterns across the multi-modality data to make up for the insufficient data scale.
Comprehensive experiments are conducted on the BraTS benchmarks, which show that the proposed cross-modality deep feature learning framework can effectively improve the brain tumor segmentation performance.
arXiv Detail & Related papers (2022-01-07T07:46:01Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.