Related papers: Automating Data Science Pipelines with Tensor Completion

Automating Data Science Pipelines with Tensor Completion

URL: http://arxiv.org/abs/2410.06408v1
Date: Tue, 8 Oct 2024 22:34:08 GMT
Title: Automating Data Science Pipelines with Tensor Completion
Authors: Shaan Pakala, Bryce Graw, Dawon Ahn, Tam Dinh, Mehnaz Tabassum Mahin, Vassilis Tsotras, Jia Chen, Evangelos E. Papalexakis,
Abstract summary: We model data science pipelines as instances of tensor completion. The goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values. We extensively evaluate existing and proposed methods in a number of datasets.
Score: 4.956678070210018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hyperparameter optimization is an essential component in many data science pipelines and typically entails exhaustive time and resource-consuming computations in order to explore the combinatorial search space. Similar to this problem, other key operations in data science pipelines exhibit the exact same properties. Important examples are: neural architecture search, where the goal is to identify the best design choices for a neural network, and query cardinality estimation, where given different predicate values for a SQL query the goal is to estimate the size of the output. In this paper, we abstract away those essential components of data science pipelines and we model them as instances of tensor completion, where each variable of the search space corresponds to one mode of the tensor, and the goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values, starting from a very small sample of observed entries. In order to do so, we first conduct a thorough experimental evaluation of existing state-of-the-art tensor completion techniques and introduce domain-inspired adaptations (such as smoothness across the discretized variable space) and an ensemble technique which is able to achieve state-of-the-art performance. We extensively evaluate existing and proposed methods in a number of datasets generated corresponding to (a) hyperparameter optimization for non-neural network models, (b) neural architecture search, and (c) variants of query cardinality estimation, demonstrating the effectiveness of tensor completion as a tool for automating data science pipelines. Furthermore, we release our generated datasets and code in order to provide benchmarks for future work on this topic.

Related papers

Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations [19.25205110583291]
A critical bottleneck is selecting the most relevant data to maximize task-specific performance.<n>Existing data selection approaches include unstable influence-based methods and more stable distribution alignment methods.<n>We introduce a dedicated similarity metric for this space to better identify task-relevant data.
arXiv Detail & Related papers (2025-03-19T11:35:57Z)
Image Classification using Combination of Topological Features and Neural Networks [1.0323063834827417]
We use the persistent homology method, a technique in topological data analysis (TDA), to extract essential topological features from the data space. This was carried out with the aim of classifying images from multiple classes in the MNIST dataset. Our approach inserts topological features into deep learning approaches composed by single and two-streams neural networks.
arXiv Detail & Related papers (2023-11-10T20:05:40Z)
On Characterizing the Evolution of Embedding Space of Neural Networks using Algebraic Topology [9.537910170141467]
We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value.
arXiv Detail & Related papers (2023-11-08T10:45:12Z)
Manifold Learning with Sparse Regularised Optimal Transport [1.949927790632678]
Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge. We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation. We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in numerical experiments.
arXiv Detail & Related papers (2023-07-19T08:05:46Z)
Generating Synthetic Datasets by Interpolating along Generalized Geodesics [18.278734644369052]
We show how to combine datasets that can be synthesised as "combinations" In particular, we show how to interpolate even between datasets with distinct and unrelated label sets. We demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
arXiv Detail & Related papers (2023-06-12T04:46:44Z)
Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering. We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Rank-R FNN: A Tensor-Based Learning Model for High-Order Data Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters. First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension. We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z)
Learning from Incomplete Features by Simultaneous Training of Neural Networks and Sparse Coding [24.3769047873156]
This paper addresses the problem of training a classifier on a dataset with incomplete features. We assume that different subsets of features (random or structured) are available at each data instance. A new supervised learning method is developed to train a general classifier, using only a subset of features per sample.
arXiv Detail & Related papers (2020-11-28T02:20:39Z)
Deep Representational Similarity Learning for analyzing neural signatures in task-based fMRI dataset [81.02949933048332]
This paper develops Deep Representational Similarity Learning (DRSL), a deep extension of Representational Similarity Analysis (RSA) DRSL is appropriate for analyzing similarities between various cognitive tasks in fMRI datasets with a large number of subjects.
arXiv Detail & Related papers (2020-09-28T18:30:14Z)
Distributed Learning via Filtered Hyperinterpolation on Manifolds [2.2046162792653017]
This paper studies the problem of learning real-valued functions on manifold. Motivated by the problem of handling large data sets, it presents a parallel data processing approach. We prove quantitative relations between the approximation quality of the learned function over the entire manifold.
arXiv Detail & Related papers (2020-07-18T10:05:18Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.