Related papers: Numeric Encoding Options with Automunge

Numeric Encoding Options with Automunge

URL: http://arxiv.org/abs/2202.09496v2
Date: Tue, 22 Feb 2022 21:38:02 GMT
Title: Numeric Encoding Options with Automunge
Authors: Nicholas J. Teague
Abstract summary: This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning. Proposals are based on options for numeric transformations available in the Automunge open source python library platform.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. Experiments demonstrate the realization of a novel generalized solution to data augmentation by noise injection for tabular learning, as may materially benefit model performance in applications with underserved training data.

Related papers

LLM Embeddings for Deep Learning on Tabular Data [10.95164847873571]
Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of data by employing separate type-specific encoding approaches. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution.
arXiv Detail & Related papers (2025-02-17T09:28:51Z)
TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations [8.072353085704627]
In this paper, we introduce TabulaX, a novel framework that leverages Large Language Models (LLMs) for multi-class transformations. We show that TabulaX outperforms existing state-of-the-art approaches in terms of accuracy, supports a broader class of transformations, and generates interpretable transformations that can be efficiently applied.
arXiv Detail & Related papers (2024-11-26T05:00:23Z)
Deep Feature Embedding for Tabular Data [2.1301560294088318]
This paper proposes a novel deep embedding framework with leverages lightweight deep neural networks. For numerical features, a two-step feature expansion and deep transformation technique is used to capture copious semantic information. Experiments are conducted on real-world datasets for performance evaluation.
arXiv Detail & Related papers (2024-08-30T10:05:24Z)
Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. We present TP-BERTa, a specifically pre-trained LM for tabular data prediction. A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z)
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z)
A Configurable Library for Generating and Manipulating Maze Datasets [0.9268994664916388]
Mazes serve as an excellent testbed due to varied generation algorithms. We present $textttmaze-dataset$, a comprehensive library for generating, processing, and visualizing datasets consisting of maze-solving tasks.
arXiv Detail & Related papers (2023-09-19T10:20:11Z)
Explaining Classifiers Trained on Raw Hierarchical Multiple-Instance Data [0.0]
A number of data sources have the natural form of structured data interchange formats (e.g. Multiple security logs in/XML format) Existing methods, such as in Hierarchical Instance Learning (HMIL) allow learning from such data in their raw form. By treating these models as sub-set selections problems, we demonstrate how interpretable explanations, with favourable properties, can be generated using computationally efficient algorithms. We compare to an explanation technique adopted from graph neural networks showing an order of magnitude speed-up and higher-quality explanations.
arXiv Detail & Related papers (2022-08-04T14:48:37Z)
Transfer Learning with Deep Tabular Models [66.67017691983182]
We show that upstream data gives tabular neural networks a decisive advantage over GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning. We propose a pseudo-feature method for cases where the upstream and downstream feature sets differ.
arXiv Detail & Related papers (2022-06-30T14:24:32Z)
Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.