FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems
- URL: http://arxiv.org/abs/2210.07768v1
- Date: Mon, 26 Sep 2022 02:31:13 GMT
- Title: FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems
- Authors: Weijie Zhao, Xuewu Jiao, Xinsheng Luo, Jingxue Li, Belhal Karimi, Ping
Li
- Abstract summary: We propose a novel end-to-end training framework that pipelines the feature extraction and the training on GPU servers to save the intermediate I/O of the feature extraction.
We present a light-weight GPU memory management algorithm that supports dynamic GPU memory allocation with minimal overhead.
- Score: 15.622358361804343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning has been widely deployed for online ads systems to predict
Click-Through Rate (CTR). Machine learning researchers and practitioners
frequently retrain CTR models to test their new extracted features. However,
the CTR model training often relies on a large number of raw input data logs.
Hence, the feature extraction can take a significant proportion of the training
time for an industrial-level CTR model. In this paper, we propose FeatureBox, a
novel end-to-end training framework that pipelines the feature extraction and
the training on GPU servers to save the intermediate I/O of the feature
extraction. We rewrite computation-intensive feature extraction operators as
GPU operators and leave the memory-intensive operator on CPUs. We introduce a
layer-wise operator scheduling algorithm to schedule these heterogeneous
operators. We present a light-weight GPU memory management algorithm that
supports dynamic GPU memory allocation with minimal overhead. We experimentally
evaluate FeatureBox and compare it with the previous in-production feature
extraction framework on two real-world ads applications. The results confirm
the effectiveness of our proposed method.
Related papers
- Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - RAF: Holistic Compilation for Deep Learning Model Training [17.956035630476173]
In this paper, we present RAF, a deep learning compiler for training.
Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph.
RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training.
arXiv Detail & Related papers (2023-03-08T17:51:13Z) - Meta-Wrapper: Differentiable Wrapping Operator for User Interest
Selection in CTR Prediction [97.99938802797377]
Click-through rate (CTR) prediction, whose goal is to predict the probability of the user to click on an item, has become increasingly significant in recommender systems.
Recent deep learning models with the ability to automatically extract the user interest from his/her behaviors have achieved great success.
We propose a novel approach under the framework of the wrapper method, which is named Meta-Wrapper.
arXiv Detail & Related papers (2022-06-28T03:28:15Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies.
To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently.
We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z) - Improving Computational Efficiency in Visual Reinforcement Learning via
Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER)
SEER is a simple modification of existing off-policy deep reinforcement learning methods.
We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z) - ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators
using Reinforcement Learning [5.251940442946459]
We propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style.
It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques.
arXiv Detail & Related papers (2020-09-04T04:59:26Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.