Related papers: tsrobprep -- an R package for robust preprocessing of time series data

tsrobprep -- an R package for robust preprocessing of time series data

URL: http://arxiv.org/abs/2104.12657v1
Date: Mon, 26 Apr 2021 15:35:11 GMT
Title: tsrobprep -- an R package for robust preprocessing of time series data
Authors: Micha{\l} Narajewski, Jens Kley-Holsteg, Florian Ziel
Abstract summary: The open source package tsrobprep introduces efficient methods for handling missing values and outliers. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Data cleaning is a crucial part of every data analysis exercise. Yet, the currently available R packages do not provide fast and robust methods for cleaning and preparation of time series data. The open source package tsrobprep introduces efficient methods for handling missing values and outliers using model based approaches. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features. By assigning to each observation a probability of being an outlying data point, the degree of outlyingness can be determined. The methods work robust and are fully tunable. Moreover, by providing the auto_data_cleaning function the data preprocessing can be carried out in one cast, without manual tuning and providing suitable results. The primary motivation of the package is the preprocessing of energy system data, however, the package is also suited for other moderate and large sized time series data set. We present application for electricity load, wind and solar power data.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning [0.0]
Data preprocessing is a critical yet frequently neglected aspect of machine learning. CleanSurvival is a reinforcement-learning-based solution for optimizing preprocessing pipelines. It can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance.
arXiv Detail & Related papers (2025-02-06T10:33:37Z)
RPS: A Generic Reservoir Patterns Sampler [1.09784964592609]
We introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data. We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets.
arXiv Detail & Related papers (2024-10-31T16:25:21Z)
Timeseria: an object-oriented time series processing library [0.40964539027092917]
Timeseria is an object-oriented time series processing library implemented in Python. It aims at making it easier to manipulate time series data and to build statistical and machine learning models on top of it.
arXiv Detail & Related papers (2024-10-12T15:29:18Z)
A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets. While obtained from external sources, the collected data share critical statistical properties with primary time series datasets. It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z)
Chronos: Learning the Language of Time Series [79.38691251254173]
Chronos is a framework for pretrained probabilistic time series models. We show that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks.
arXiv Detail & Related papers (2024-03-12T16:53:54Z)
EXPRTS: Exploring and Probing the Robustness of Time Series Forecasting Models [1.23187154417297]
We develop an interpretable and simple framework for generating time series.<n>Our method combines time-series decompositions with analytic functions, and is able to generate time series with characteristics matching both in- and out-of-distribution data.<n>We show how our framework can generate meaningful OOD time series that improve model robustness.
arXiv Detail & Related papers (2024-03-06T07:34:47Z)
Probabilistic Modeling for Sequences of Sets in Continuous-Time [14.423456635520084]
We develop a general framework for modeling set-valued data in continuous-time. We also develop inference methods that can use such models to answer probabilistic queries.
arXiv Detail & Related papers (2023-12-22T20:16:10Z)
Stable Training of Probabilistic Models Using the Leave-One-Out Maximum Log-Likelihood Objective [0.7373617024876725]
Kernel density estimation (KDE) based models are popular choices for this task, but they fail to adapt to data regions with varying densities. An adaptive KDE model is employed to circumvent this, where each kernel in the model has an individual bandwidth. A modified expectation-maximization algorithm is employed to accelerate the optimization speed reliably.
arXiv Detail & Related papers (2023-10-05T14:08:42Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
Time-Series Imputation with Wasserstein Interpolation for Optimal Look-Ahead-Bias and Variance Tradeoff [66.59869239999459]
In finance, imputation of missing returns may be applied prior to training a portfolio optimization model. There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data. We propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation.
arXiv Detail & Related papers (2021-02-25T09:05:35Z)
Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data. Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z)
Time series forecasting with Gaussian Processes needs priors [1.5877673959068452]
We propose an optimal kernel and reliable estimation of the hyper parameters. We present results on many time series of different types; our GP model is more accurate than state-of-the-art time series models.
arXiv Detail & Related papers (2020-09-17T06:46:51Z)
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.