SynDelay: A Synthetic Dataset for Delivery Delay Prediction
- URL: http://arxiv.org/abs/2509.05325v1
- Date: Sat, 30 Aug 2025 21:54:37 GMT
- Title: SynDelay: A Synthetic Dataset for Delivery Delay Prediction
- Authors: Liming Xu, Yunbo Long, Alexandra Brintrup,
- Abstract summary: We present SynDelay, a synthetic dataset designed for delivery delay prediction.<n>It is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI.
- Score: 50.56729406793283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence (AI) is transforming supply chain management, yet progress in predictive tasks -- such as delivery delay prediction -- remains constrained by the scarcity of high-quality, openly available datasets. Existing datasets are often proprietary, small, or inconsistently maintained, hindering reproducibility and benchmarking. We present SynDelay, a synthetic dataset designed for delivery delay prediction. Generated using an advanced generative model trained on real-world data, SynDelay preserves realistic delivery patterns while ensuring privacy. Although not entirely free of noise or inconsistencies, it provides a challenging and practical testbed for advancing predictive modelling. To support adoption, we provide baseline results and evaluation metrics as initial benchmarks, serving as reference points rather than state-of-the-art claims. SynDelay is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI. We encourage the community to contribute datasets, models, and evaluation practices to advance research in this area. All code is openly accessible at https://supplychaindatahub.org.
Related papers
- TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting [42.2854432715079]
We present TempoPFN, a time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data.<n>The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths.
arXiv Detail & Related papers (2025-10-29T13:27:18Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - Using Imperfect Synthetic Data in Downstream Inference Tasks [50.40949503799331]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Synthetic-Powered Predictive Inference [28.99972786873634]
Synthetic-powered predictive inference (SPI)<n>An empirical quantile mapping that aligns nonconformity scores from trusted, real data with those from synthetic data.<n> Experiments on image classification -- augmenting data with synthetic diffusion-model generated images -- demonstrate notable improvements in predictive efficiency in data-scarce settings.
arXiv Detail & Related papers (2025-05-19T17:55:56Z) - Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting.
Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server.
We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z) - Generating Accurate Synthetic Survival Data by Conditioning on Outcomes [16.401141867387324]
Synthetically generated data can improve privacy, fairness, and data accessibility.<n>One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases.<n>Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data.
arXiv Detail & Related papers (2024-05-27T16:34:18Z) - Are Synthetic Time-series Data Really not as Good as Real Data? [29.852306720544224]
Time-series data presents limitations stemming from data quality issues, bias and vulnerabilities, and generalization problem.
We introduce InfoBoost -- a highly versatile cross-domain data synthesizing framework with time series representation learning capability.
We have developed a method based on synthetic data that enables model training without the need for real data, surpassing the performance of models trained with real data.
arXiv Detail & Related papers (2024-02-01T13:59:04Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - Evolving GANs: When Contradictions Turn into Compliance [11.353579556329962]
We propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data.
This provides the added advantage that now the generated data can be used for other similar tasks.
arXiv Detail & Related papers (2021-06-18T06:51:35Z) - STAN: Synthetic Network Traffic Generation with Generative Neural Models [10.54843182184416]
This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets.
Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time.
We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
arXiv Detail & Related papers (2020-09-27T04:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.