HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
- URL: http://arxiv.org/abs/2505.14725v1
- Date: Mon, 19 May 2025 19:37:49 GMT
- Title: HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
- Authors: Xuejun Sun, Yiran Song, Xiaochen Zhou, Ruilie Cai, Yu Zhang, Xinyi Li, Rui Peng, Jialiu Xie, Yuanyuan Yan, Muyao Tang, Prem Lakshmanane, Baiming Zou, James S. Hagood, Raymond J. Pickles, Didong Li, Fei Zou, Xiaojing Zheng,
- Abstract summary: Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells.<n>HR-VILAGE-3K3M is the largest longitudinal transcriptomic resource for human respiratory viral immunization.
- Score: 8.64940622146001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.
Related papers
- Harnessing Preference Optimisation in Protein LMs for Hit Maturation in Cell Therapy [0.5315454965484603]
Cell and immunotherapy offer transformative potential for treating diseases like cancer and autoimmune disorders by modulating the immune system.<n>The development of these therapies is resource-intensive, with the majority of drug candidates failing to progress beyond laboratory testing.<n>Recent advances in machine learning have revolutionised areas such as protein engineering, applications in immunotherapy remain limited due to the scarcity of large-scale, standardised datasets and the complexity of cellular systems.
arXiv Detail & Related papers (2024-12-02T11:21:58Z) - Machine Learning-Based Analysis of Ebola Virus' Impact on Gene
Expression in Nonhuman Primates [3.842863644161241]
This study introduces the Supervised Magnitude-Altitude Scoring (SMAS) methodology, a machine learning-based approach, for analyzing gene expression data obtained from nonhuman primates (NHPs) infected with Ebola virus (EBOV)
We utilize a comprehensive dataset of NanoString gene expression profiles from Ebola-infected NHPs, deploying the SMAS system for nuanced host-pathogen interaction analysis.
Key finding of our research is the identification of IFI6 and IFI27 as critical biomarkers, demonstrating exceptional predictive performance with 100% accuracy and Area Under the Curve (AUC) metrics in classifying various stages of
arXiv Detail & Related papers (2024-01-16T18:31:23Z) - Agent-Based Model: Simulating a Virus Expansion Based on the Acceptance
of Containment Measures [65.62256987706128]
Compartmental epidemiological models categorize individuals based on their disease status.
We propose an ABM architecture that combines an adapted SEIRD model with a decision-making model for citizens.
We illustrate the designed model by examining the progression of SARS-CoV-2 infections in A Coruna, Spain.
arXiv Detail & Related papers (2023-07-28T08:01:05Z) - Scalable Pathogen Detection from Next Generation DNA Sequencing with
Deep Learning [3.8175773487333857]
We propose MG2Vec, a deep learning-based solution that uses the transformer network as its backbone.
We show that the proposed approach can help detect pathogens from uncurated, real-world clinical samples.
We provide a comprehensive evaluation of a novel representation learning framework for metagenome-based disease diagnostics with deep learning.
arXiv Detail & Related papers (2022-11-30T00:13:59Z) - Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021.
We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - COVIDx-US -- An open-access benchmark dataset of ultrasound imaging data
for AI-driven COVID-19 analytics [116.6248556979572]
COVIDx-US is an open-access benchmark dataset of COVID-19 related ultrasound imaging data.
It consists of 93 lung ultrasound videos and 10,774 processed images of patients infected with SARS-CoV-2 pneumonia, non-SARS-CoV-2 pneumonia, as well as healthy control cases.
arXiv Detail & Related papers (2021-03-18T03:31:33Z) - Semi-supervised Neural Networks solve an inverse problem for modeling
Covid-19 spread [61.9008166652035]
We study the spread of COVID-19 using a semi-supervised neural network.
We assume a passive part of the population remains isolated from the virus dynamics.
arXiv Detail & Related papers (2020-10-10T19:33:53Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations [0.0]
Influenza virus sequences were identified as an ideal test case for this deep learning framework.
MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids.
Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
arXiv Detail & Related papers (2020-08-26T20:20:30Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.