A Dataset of Dengue Hospitalizations in Brazil (1999 to 2021) with Weekly Disaggregation from Monthly Counts
- URL: http://arxiv.org/abs/2601.16994v1
- Date: Mon, 12 Jan 2026 20:27:57 GMT
- Title: A Dataset of Dengue Hospitalizations in Brazil (1999 to 2021) with Weekly Disaggregation from Monthly Counts
- Authors: Lucas M. Morello, Matheus Lima Castro, Pedro Cesar M. G. Camargo, Liliane Moreira Nery, Darllan Collins da Cunha e Silva, Leopoldo Lusquino Filho,
- Abstract summary: This data paper describes and publicly releases this dataset (v) published on Zenodo under DOI 10.5281/zenodo.18189192.<n>Motivated by the need to increase the temporal granularity of originally monthly data to enable more effective training of AI models for epidemiological forecasting, the dataset harmonizes municipal-level dengue hospitalization time series across Brazil and disaggregates them to weekly resolution (epidemiological weeks) through a protocol with a correction step that preserves monthly totals.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This data paper describes and publicly releases this dataset (v1.0.0), published on Zenodo under DOI 10.5281/zenodo.18189192. Motivated by the need to increase the temporal granularity of originally monthly data to enable more effective training of AI models for epidemiological forecasting, the dataset harmonizes municipal-level dengue hospitalization time series across Brazil and disaggregates them to weekly resolution (epidemiological weeks) through an interpolation protocol with a correction step that preserves monthly totals. The statistical and temporal validity of this disaggregation was assessed using a high-resolution reference dataset from the state of Sao Paulo (2024), which simultaneously provides monthly and epidemiological-week counts, enabling a direct comparison of three strategies: linear interpolation, jittering, and cubic spline. Results indicated that cubic spline interpolation achieved the highest adherence to the reference data, and this strategy was therefore adopted to generate weekly series for the 1999 to 2021 period. In addition to hospitalization time series, the dataset includes a comprehensive set of explanatory variables commonly used in epidemiological and environmental modeling, such as demographic density, CH4, CO2, and NO2 emissions, poverty and urbanization indices, maximum temperature, mean monthly precipitation, minimum relative humidity, and municipal latitude and longitude, following the same temporal disaggregation scheme to ensure multivariate compatibility. The paper documents the datasets provenance, structure, formats, licenses, limitations, and quality metrics (MAE, RMSE, R2, KL, JSD, DTW, and the KS test), and provides usage recommendations for multivariate time-series analysis, environmental health studies, and the development of machine learning and deep learning models for outbreak forecasting.
Related papers
- Independent Component Discovery in Temporal Count Data [46.526610368455096]
We introduce a generative framework for independent component analysis of temporal count data, combining regime-adaptive dynamics with Poisson log-normal emissions.<n>The model identifies disentangled components with regime-dependent contributions, enabling representation learning and perturbations analysis.
arXiv Detail & Related papers (2026-01-29T13:30:10Z) - Multi-modal Adaptive Estimation for Temporal Respiratory Disease Outbreak [8.861941883057098]
MAESTRO is a novel, unified framework that integrates spectro-temporal modeling with multi-modal data fusion.<n> Evaluated on over 11 years of Hong Kong data, MAESTRO achieves a superior model fit with an R-square of 0.956.<n>The modular and reproducible pipeline is made publicly available to facilitate deployment and extension to other regions and pathogens.
arXiv Detail & Related papers (2025-09-10T13:27:40Z) - Multitask LSTM for Arboviral Outbreak Prediction Using Public Health Data [0.0]
This paper presents a multitask learning approach for the joint prediction of arboviral outbreaks and case counts in Recife, Brazil.<n>The proposed model concurrently performs binary classification (outbreak detection) and regression (case forecasting) tasks.<n>The architecture delivers competitive performance across diseases and tasks, demonstrating the feasibility and advantages of unified modeling strategies for scalable epidemic forecasting in data-limited public health scenarios.
arXiv Detail & Related papers (2025-05-07T16:58:18Z) - Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift [46.84912148188679]
Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings.<n>Existing approaches fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time.<n>We propose two novel methods for estimating target site-specific causal effects in multi-source settings.
arXiv Detail & Related papers (2025-01-30T23:21:25Z) - Grouped self-attention mechanism for a memory-efficient Transformer [64.0125322353281]
Real-world tasks such as forecasting weather, electricity consumption, and stock market involve predicting data that vary over time.
Time-series data are generally recorded over a long period of observation with long sequences owing to their periodic characteristics and long-range dependencies over time.
We propose two novel modules, Grouped Self-Attention (GSA) and Compressed Cross-Attention (CCA)
Our proposed model efficiently exhibited reduced computational complexity and performance comparable to or better than existing methods.
arXiv Detail & Related papers (2022-10-02T06:58:49Z) - Discrepancies in Epidemiological Modeling of Aggregated Heterogeneous
Data [1.433758865948252]
We show that state-of-the-art models for estimating epidemiological parameters, e.g.transmission rates, can be inappropriate when faced with complex systems.
We generate three complex outbreak scenarios by combining incidence curves from multiple epidemics.
We evaluate two data-generating models within this Bayesian inference framework.
arXiv Detail & Related papers (2021-06-20T03:41:19Z) - Modeling the geospatial evolution of COVID-19 using spatio-temporal
convolutional sequence-to-sequence neural networks [48.7576911714538]
Portugal was the country in the world with the largest incidence rate, with 14-days incidence rates per 100,000 inhabitants in excess of 1000.
Despite its importance, accurate prediction of the geospatial evolution of COVID-19 remains a challenge.
arXiv Detail & Related papers (2021-05-06T15:24:00Z) - Comparison of Traditional and Hybrid Time Series Models for Forecasting
COVID-19 Cases [0.5849513679510832]
The coronavirus outbreak of December 2019 has already infected millions all over the world and continues to spread on.
Just when the curve of the outbreak had started to flatten, many countries have again started to witness a rise in cases.
A thorough analysis of time-series forecasting models is therefore required to equip state authorities and health officials with immediate strategies for future times.
arXiv Detail & Related papers (2021-05-05T14:56:27Z) - STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological
Regularization [76.57716281104938]
We develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously.
STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations.
We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic.
arXiv Detail & Related papers (2020-12-08T21:21:47Z) - Ensemble Forecasting of the Zika Space-TimeSpread with Topological Data
Analysis [13.838100337224075]
Zika virus is primarily transmitted through bites of infected mosquitoes of the species Aedes aegypti and Aedes albopictus.
The abundance of mosquitoes and mosquitoes, as a result, the prevalence of Zika virus infections are common in areas which have high precipitation, high temperature, and high population density.
We introduce new concept of cumulative Betti numbers and then integrate the cumulative Betti numbers as topological descriptors into three machine learning models.
arXiv Detail & Related papers (2020-09-24T16:42:19Z) - Temporal Phenotyping using Deep Predictive Clustering of Disease
Progression [97.88605060346455]
We develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest.
Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks.
arXiv Detail & Related papers (2020-06-15T20:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.