You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a
(Mostly) Serverless and Open Stack
- URL: http://arxiv.org/abs/2107.07346v1
- Date: Thu, 15 Jul 2021 14:00:29 GMT
- Title: You Do Not Need a Bigger Boat: Recommendations at Reasonable Scale in a
(Mostly) Serverless and Open Stack
- Authors: Jacopo Tagliabue
- Abstract summary: We argue that immature data pipelines are preventing a large portion of industry practitioners from leveraging the latest research on recommender systems.
We propose our template data stack for machine learning at "reasonable scale", and show how many challenges are solved by embracing a serverless paradigm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We argue that immature data pipelines are preventing a large portion of
industry practitioners from leveraging the latest research on recommender
systems. We propose our template data stack for machine learning at "reasonable
scale", and show how many challenges are solved by embracing a serverless
paradigm. Leveraging our experience, we detail how modern open source can
provide a pipeline processing terabytes of data with limited infrastructure
work.
Related papers
- Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application [54.984348122105516]
Large Language Models (LLMs) pretrained on massive text corpus presents a promising avenue for enhancing recommender systems.
We propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge.
arXiv Detail & Related papers (2024-05-07T04:00:30Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - Solving Data Quality Problems with Desbordante: a Demo [35.75243108496634]
Desbordante is an open-source data profiler that aims to close this gap.
It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
arXiv Detail & Related papers (2023-07-27T15:26:26Z) - Desbordante: from benchmarking suite to high-performance
science-intensive data profiler (preprint) [36.537985747809245]
Desbordante is a high-performance science-intensive data profiler with open source code.
Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment.
It is efficient, resilient to crashes, and scalable.
arXiv Detail & Related papers (2023-01-14T19:14:51Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for
Image Superresolution [0.0]
WPPNets are CNNs trained by a new unsupervised loss function for image superresolution of materials microstructures.
We show that WPPNets are much more stable under inaccurate knowledge or perturbations of the forward operator.
arXiv Detail & Related papers (2022-01-20T13:04:19Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.