PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
- URL: http://arxiv.org/abs/2505.20323v1
- Date: Fri, 23 May 2025 18:01:09 GMT
- Title: PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
- Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss,
- Abstract summary: We present PMOA-TTS, the first openly available dataset of 124,699 annotated PubMed Open Access case reports.<n>Our approach combines filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction.<n>To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics.
- Score: 9.924632472835551
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index > 0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 $\pm$ 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: https://huggingface.co/datasets/snoroozi/pmoa-tts .
Related papers
- SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology [1.5425688173297465]
SigBERT is an innovative temporal survival analysis framework designed to process a large number of clinical reports per patient.<n>It processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings.<n>It was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort.
arXiv Detail & Related papers (2025-07-25T12:33:25Z) - Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder [8.239710313549466]
This study integrates pause features with semantic coherence metrics across three datasets.<n>Key findings demonstrate that pause features alone robustly predict the severity of formal thought disorder (FTD)<n>These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech.
arXiv Detail & Related papers (2025-07-17T22:00:16Z) - A theoretical framework for self-supervised contrastive learning for continuous dependent data [86.50780641055258]
Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision.<n>We propose a novel theoretical framework for contrastive SSL tailored to emphsemantic independence between samples.<n>Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of $4.17$% and $2.08$%, respectively.
arXiv Detail & Related papers (2025-06-11T14:23:47Z) - A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports [10.869574822060553]
We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps.<n>This work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.
arXiv Detail & Related papers (2025-04-15T20:54:19Z) - Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis [7.734726150561087]
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter.<n>We construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models.
arXiv Detail & Related papers (2025-04-12T03:07:44Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Knowledge Enhanced Conditional Imputation for Healthcare Time-series [9.937117045677923]
Conditional Self-Attention Imputation (CSAI) is a novel recurrent neural network architecture designed to address the challenges of complex missing data patterns.
CSAI extends the current state-of-the-art neural network-based imputation methods by introducing key modifications specifically adapted to EHR data characteristics.
This work significantly advances the state of neural network imputation applied to EHRs by more closely aligning algorithmic imputation with clinical realities.
arXiv Detail & Related papers (2023-12-27T20:42:40Z) - Temporal Supervised Contrastive Learning for Modeling Patient Risk
Progression [12.185263022907744]
We propose a supervised contrastive learning framework that learns an embedding representation for each time step of a patient time series.
Our framework learns the embedding space to have the following properties: (1) nearby points in the embedding space have similar predicted class probabilities, (2) adjacent time steps of the same time series map to nearby points in the embedding space, and (3) time steps with very different raw feature vectors map to far apart regions of the embedding space.
arXiv Detail & Related papers (2023-12-10T16:43:15Z) - Clairvoyance: A Pipeline Toolkit for Medical Time Series [95.22483029602921]
Time-series learning is the bread and butter of data-driven *clinical decision support*
Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a software toolkit.
Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
arXiv Detail & Related papers (2023-10-28T12:08:03Z) - T-Phenotype: Discovering Phenotypes of Predictive Temporal Patterns in
Disease Progression [82.85825388788567]
We develop a novel temporal clustering method, T-Phenotype, to discover phenotypes of predictive temporal patterns from labeled time-series data.
We show that T-Phenotype achieves the best phenotype discovery performance over all the evaluated baselines.
arXiv Detail & Related papers (2023-02-24T13:30:35Z) - Shorter Latency of Real-time Epileptic Seizure Detection via
Probabilistic Prediction [6.480989310008518]
We propose a novel deep learning framework intended for shortening epileptic seizure detection latency via probabilistic prediction.
We implement the proposed framework on two prevalent datasets -- CHB-MIT scalp EEG dataset and SWEC-ETHZ intracranial EEG dataset.
The obtained detection latencies are at least 50% shorter than state-of-the-art results reported in previous studies.
arXiv Detail & Related papers (2023-01-04T08:45:47Z) - Clinical Temporal Relation Extraction with Probabilistic Soft Logic
Regularization and Global Inference [50.029659413650194]
Existing methods either require expensive feature engineering or are incapable of modeling the global dependencies among the events.
In this paper, we propose a novel method, Clinical Temporal ReLation Exaction with Probabilistic Soft Logic Regularization and Global Inference.
arXiv Detail & Related papers (2020-12-16T08:23:03Z) - Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data.
Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z) - Trajectories, bifurcations and pseudotime in large clinical datasets:
applications to myocardial infarction and diabetes data [94.37521840642141]
We suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values.
The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations.
arXiv Detail & Related papers (2020-07-07T21:04:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.