Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
- URL: http://arxiv.org/abs/2507.06021v1
- Date: Tue, 08 Jul 2025 14:30:10 GMT
- Title: Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
- Authors: George Barrowclough, Marian Andrecki, James Shinner, Daniele Donghi,
- Abstract summary: Kamae is a Python library that bridges the gap by translating PySpark preprocessing pipelines into equivalent Keras models.<n>The framework is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework's utility is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.
Related papers
- PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings.<n>PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers.<n>We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z) - Cuvis.Ai: An Open-Source, Low-Code Software Ecosystem for Hyperspectral Processing and Classification [0.4038539043067986]
cuvis.ai is an open-source and low-code software ecosystem for data acquisition, preprocessing, and model training.
The package is written in Python and provides wrappers around common machine learning libraries.
arXiv Detail & Related papers (2024-11-18T06:33:40Z) - KerasCV and KerasNLP: Vision and Language Power-Ups [9.395199188271254]
KerasCV and KerasNLP are extensions of the Keras API for Computer Vision and Natural Language Processing.
These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance.
The libraries are fully open-source (Apache 2.0 license) and available on GitHub.
arXiv Detail & Related papers (2024-05-30T16:58:34Z) - torchgfn: A PyTorch GFlowNet library [56.071033896777784]
torchgfn is a PyTorch library that aims to address this need.
It provides users with a simple API for environments and useful abstractions for samplers and losses.
arXiv Detail & Related papers (2023-05-24T00:20:59Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - PyHHMM: A Python Library for Heterogeneous Hidden Markov Models [63.01207205641885]
PyHHMM is an object-oriented Python implementation of Heterogeneous-Hidden Markov Models (HHMMs)
PyHHMM emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training.
PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License.
arXiv Detail & Related papers (2022-01-12T07:32:36Z) - PTRAIL -- A python package for parallel trajectory data preprocessing [2.348339658768759]
Trajectory data represent a trace of an object that changes its position in space over time.
There is a need for software specifically tailored to preprocess trajectory data.
We propose PTRAIL, a python package offering several trajectory preprocessing steps.
arXiv Detail & Related papers (2021-08-26T20:14:07Z) - SuperSuit: Simple Microwrappers for Reinforcement Learning Environments [0.0]
SuperSuit is a Python library that includes all popular wrappers and wrappers that can easily apply functions to the observations/actions/reward.
It's compatible with the standard Gym environment specification, as well as the PettingZoo specification for multi-agent environments.
arXiv Detail & Related papers (2020-08-17T00:30:06Z) - Picasso: A Sparse Learning Library for High Dimensional Data Analysis in
R and Python [77.33905890197269]
We describe a new library which implements a unified pathwise coordinate optimization for a variety of sparse learning problems.
The library is coded in R++ and has user-friendly sparse experiments.
arXiv Detail & Related papers (2020-06-27T02:39:24Z) - torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models [19.024035785367044]
We design and implement a ready-to-use library in PyTorch for performing micro-batch pipeline parallelism with checkpointing proposed by GPipe.
We show that each component is necessary to fully benefit from pipeline parallelism in such environment, and demonstrate the efficiency of the library.
arXiv Detail & Related papers (2020-04-21T11:27:00Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.