You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a
(Mostly) Serverless and Open Stack
- URL: http://arxiv.org/abs/2107.07346v1
- Date: Thu, 15 Jul 2021 14:00:29 GMT
- Title: You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a
(Mostly) Serverless and Open Stack
- Authors: Jacopo Tagliabue
- Abstract summary: We argue that immature data pipelines are preventing a large portion of industry practitioners from leveraging the latest research on recommender systems.
We propose our template data stack for machine learning at "reasonable scale", and show how many challenges are solved by embracing a serverless paradigm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We argue that immature data pipelines are preventing a large portion of
industry practitioners from leveraging the latest research on recommender
systems. We propose our template data stack for machine learning at "reasonable
scale", and show how many challenges are solved by embracing a serverless
paradigm. Leveraging our experience, we detail how modern open source can
provide a pipeline processing terabytes of data with limited infrastructure
work.
Related papers
- Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Bauplan: zero-copy, scale-up FaaS for data pipelines [4.6797109107617105]
bauplan is a novel F programming model and serverless runtime designed for data practitioners.
bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments.
We show that bauplan both better performance and a superior developer experience for data workloads by making trade-off of reducing generality in favor of data-awareness.
arXiv Detail & Related papers (2024-10-22T22:49:01Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - Solving Data Quality Problems with Desbordante: a Demo [35.75243108496634]
Desbordante is an open-source data profiler that aims to close this gap.
It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
arXiv Detail & Related papers (2023-07-27T15:26:26Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Desbordante: from benchmarking suite to high-performance
science-intensive data profiler (preprint) [36.537985747809245]
Desbordante is a high-performance science-intensive data profiler with open source code.
Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment.
It is efficient, resilient to crashes, and scalable.
arXiv Detail & Related papers (2023-01-14T19:14:51Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.