tsrobprep -- an R package for robust preprocessing of time series data
- URL: http://arxiv.org/abs/2104.12657v1
- Date: Mon, 26 Apr 2021 15:35:11 GMT
- Title: tsrobprep -- an R package for robust preprocessing of time series data
- Authors: Micha{\l} Narajewski, Jens Kley-Holsteg, Florian Ziel
- Abstract summary: The open source package tsrobprep introduces efficient methods for handling missing values and outliers.
For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs.
For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers typical time series related properties as features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data cleaning is a crucial part of every data analysis exercise. Yet, the
currently available R packages do not provide fast and robust methods for
cleaning and preparation of time series data. The open source package tsrobprep
introduces efficient methods for handling missing values and outliers using
model based approaches. For data imputation a probabilistic replacement model
is proposed, which may consist of autoregressive components and external
inputs. For outlier detection a clustering algorithm based on finite mixture
modelling is introduced, which considers typical time series related properties
as features. By assigning to each observation a probability of being an
outlying data point, the degree of outlyingness can be determined. The methods
work robust and are fully tunable. Moreover, by providing the
auto_data_cleaning function the data preprocessing can be carried out in one
cast, without manual tuning and providing suitable results. The primary
motivation of the package is the preprocessing of energy system data, however,
the package is also suited for other moderate and large sized time series data
set. We present application for electricity load, wind and solar power data.
Related papers
- RPS: A Generic Reservoir Patterns Sampler [1.09784964592609]
We introduce an approach that harnesses a weighted reservoir to facilitate direct pattern sampling from streaming batch data.
We present a generic algorithm capable of addressing temporal biases and handling various pattern types, including sequential, weighted, and unweighted itemsets.
arXiv Detail & Related papers (2024-10-31T16:25:21Z) - A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets.
While obtained from external sources, the collected data share critical statistical properties with primary time series datasets.
It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z) - Chronos: Learning the Language of Time Series [79.38691251254173]
Chronos is a framework for pretrained probabilistic time series models.
We show that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks.
arXiv Detail & Related papers (2024-03-12T16:53:54Z) - Probabilistic Modeling for Sequences of Sets in Continuous-Time [14.423456635520084]
We develop a general framework for modeling set-valued data in continuous-time.
We also develop inference methods that can use such models to answer probabilistic queries.
arXiv Detail & Related papers (2023-12-22T20:16:10Z) - Stable Training of Probabilistic Models Using the Leave-One-Out Maximum Log-Likelihood Objective [0.7373617024876725]
Kernel density estimation (KDE) based models are popular choices for this task, but they fail to adapt to data regions with varying densities.
An adaptive KDE model is employed to circumvent this, where each kernel in the model has an individual bandwidth.
A modified expectation-maximization algorithm is employed to accelerate the optimization speed reliably.
arXiv Detail & Related papers (2023-10-05T14:08:42Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Time-Series Imputation with Wasserstein Interpolation for Optimal
Look-Ahead-Bias and Variance Tradeoff [66.59869239999459]
In finance, imputation of missing returns may be applied prior to training a portfolio optimization model.
There is an inherent trade-off between the look-ahead-bias of using the full data set for imputation and the larger variance in the imputation from using only the training data.
We propose a Bayesian posterior consensus distribution which optimally controls the variance and look-ahead-bias trade-off in the imputation.
arXiv Detail & Related papers (2021-02-25T09:05:35Z) - Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data.
Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z) - Time series forecasting with Gaussian Processes needs priors [1.5877673959068452]
We propose an optimal kernel and reliable estimation of the hyper parameters.
We present results on many time series of different types; our GP model is more accurate than state-of-the-art time series models.
arXiv Detail & Related papers (2020-09-17T06:46:51Z) - PClean: Bayesian Data Cleaning at Scale with Domain-Specific
Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data.
PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.