Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties
- URL: http://arxiv.org/abs/2510.01520v1
- Date: Wed, 01 Oct 2025 23:34:46 GMT
- Title: Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties
- Authors: Hossein Sholehrasa, Xuan Xu, Doina Caragea, Jim E. Riviere, Majid Jaberi-Douraki,
- Abstract summary: Adverse events (AEs) may signal unexpected or toxicokinetic effects, increasing the risk of violative residues in the food chain.<n>This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using 1.28 million reports from the U.S. FDA's OpenFDA Center for Veterinary Medicine.
- Score: 4.53318808068234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA's OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improved minority-class detection, particularly in ExcelFormer and XGBoost. Interpretability via SHAP identified biologically plausible predictors, including lung, heart, and bronchial disorders, animal demographics, and drug physicochemical properties. These features were strongly linked to fatal outcomes. Overall, the framework shows that combining rigorous data engineering, advanced machine learning, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes. The approach supports FARAD's mission by enabling early detection of high-risk drug-event profiles, strengthening residue risk assessment, and informing regulatory and clinical decision-making.
Related papers
- Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute [0.0]
This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety.<n>We generated 10,382 synthetic notes using a privacy-preserving "template-only" regime.<n>We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
arXiv Detail & Related papers (2026-01-13T19:35:25Z) - Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning [0.0]
Predicting the potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1) is a critical challenge in early drug discovery.<n>We present a deep learning framework for the quantitative regression of pIC50 values using fine-tuned variants of ChemBERTa.<n>Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility.
arXiv Detail & Related papers (2025-12-03T20:42:22Z) - Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset [0.02030567625639093]
The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine.<n>This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study cohort under real-world constraints.<n>It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions.
arXiv Detail & Related papers (2025-10-23T04:52:42Z) - Emergenet: A Digital Twin of Sequence Evolution for Scalable Emergence Risk Assessment of Animal Influenza A Strains [3.6744384899193405]
This study introduces Emergenet, a tool to infer a digital twin of sequence evolution to chart how new variants might emerge in the wild.
Our predictions based on Emergenets built only using 220,151 Hemagglutinnin (HA) sequences consistently outperform WHO seasonal vaccine recommendations.
arXiv Detail & Related papers (2024-11-26T06:50:18Z) - Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification [0.0]
Alzheimer's disease (AD), the leading type, accounts for 70% of cases.<n>EEG measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging.<n>This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability.
arXiv Detail & Related papers (2024-11-20T10:31:02Z) - CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
We introduce a novel method called causal rule generation with target trial emulation framework (CRTRE)
CRTRE applies randomize trial design principles to estimate the causal effect of association rules.
We then incorporate such association rules for the downstream applications such as prediction of disease onsets.
arXiv Detail & Related papers (2024-11-10T02:40:06Z) - Fuzzy Rule based Intelligent Cardiovascular Disease Prediction using Complex Event Processing [0.8668211481067458]
Cardiovascular disease (CVDs) is a rapidly rising global concern due to unhealthy diets, lack of physical activity, and other factors.
Recent research has focused on accurate and timely disease prediction to reduce risk and fatalities.
We propose a fuzzy rule-based system for monitoring clinical data to provide real-time decision support.
arXiv Detail & Related papers (2024-09-19T16:36:24Z) - SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity
Prediction [127.43571146741984]
Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery.
wet experiments remain the most reliable method, but they are time-consuming and resource-intensive.
Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue.
We present the SSM-DTA framework, which incorporates three simple yet highly effective strategies.
arXiv Detail & Related papers (2022-06-20T14:53:25Z) - Filter Drug-induced Liver Injury Literature with Natural Language
Processing and Ensemble Learning [0.0]
Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver.
Life-threatening results including liver failure or death were also reported in severe DILI cases.
Data extraction from previous publications relies heavily on manual labelling.
Recent development of artificial intelligence enabled automatic processing of biomedical texts.
arXiv Detail & Related papers (2022-03-09T23:53:07Z) - DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for
AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise
Annotations [90.27736364704108]
We present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery.
DrugOOD comes with an open-source Python package that fully automates benchmarking processes.
We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction.
arXiv Detail & Related papers (2022-01-24T12:32:48Z) - Predicting Chemical Hazard across Taxa through Machine Learning [0.3262230127283452]
We analyze the relevance of taxonomy and experimental setup, and show that taking them into account can lead to considerable improvements in the classification performance.
We use our approach with standard machine learning models (K-nearest neighbors, random forests and deep neural networks), as well as the recently proposed Read-Across Structure Activity Relationship (RASAR) models.
arXiv Detail & Related papers (2021-10-07T15:33:58Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.