PHN: Parallel heterogeneous network with soft gating for CTR prediction
- URL: http://arxiv.org/abs/2206.09184v1
- Date: Sat, 18 Jun 2022 11:37:53 GMT
- Title: PHN: Parallel heterogeneous network with soft gating for CTR prediction
- Authors: Ri Su, Alphonse Houssou Hounye, Cong Cao, Muzhou Hou
- Abstract summary: This paper proposes a Parallel Heterogeneous Network (PHN) model, which constructs a network with parallel structure.
residual link with trainable parameters are used in the network to mitigate the influence of weak gradient phenomenon.
- Score: 2.9722444664527243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Click-though Rate (CTR) prediction task is a basic task in recommendation
system. Most of the previous researches of CTR models built based on Wide \&
deep structure and gradually evolved into parallel structures with different
modules. However, the simple accumulation of parallel structures can lead to
higher structural complexity and longer training time. Based on the Sigmoid
activation function of output layer, the linear addition activation value of
parallel structures in the training process is easy to make the samples fall
into the weak gradient interval, resulting in the phenomenon of weak gradient,
and reducing the effectiveness of training. To this end, this paper proposes a
Parallel Heterogeneous Network (PHN) model, which constructs a network with
parallel structure through three different interaction analysis methods, and
uses Soft Selection Gating (SSG) to feature heterogeneous data with different
structure. Finally, residual link with trainable parameters are used in the
network to mitigate the influence of weak gradient phenomenon. Furthermore, we
demonstrate the effectiveness of PHN in a large number of comparative
experiments, and visualize the performance of the model in training process and
structure.
Related papers
- Ray-Tracing for Conditionally Activated Neural Networks [4.9844734080376725]
We introduce a novel architecture for conditionally activated neural networks with a sampling mechanism that converges to an optimized configuration of expert activation.
Experimental results demonstrate that this approach achieves competitive accuracy compared to conventional baselines.
arXiv Detail & Related papers (2025-02-20T18:09:03Z) - Orthogonal Stochastic Configuration Networks with Adaptive Construction
Parameter for Data Analytics [6.940097162264939]
randomness makes SCNs more likely to generate approximate linear correlative nodes that are redundant and low quality.
In light of a fundamental principle in machine learning, that is, a model with fewer parameters holds improved generalization.
This paper proposes orthogonal SCN, termed OSCN, to filtrate out the low-quality hidden nodes for network structure reduction.
arXiv Detail & Related papers (2022-05-26T07:07:26Z) - Accumulated Decoupled Learning: Mitigating Gradient Staleness in
Inter-Layer Model Parallelization [16.02377434191239]
We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect.
We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature.
The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
arXiv Detail & Related papers (2020-12-03T11:52:55Z) - DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator
Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead.
Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure.
Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z) - ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution [57.635467829558664]
We introduce a structural regularization across convolutional kernels in a CNN.
We show that CNNs now maintain performance with dramatic reduction in parameters and computations.
arXiv Detail & Related papers (2020-09-04T20:41:47Z) - TSAM: Temporal Link Prediction in Directed Networks based on
Self-Attention Mechanism [2.5144068869465994]
We propose a deep learning model based on graph neural networks (GCN) and self-attention mechanism, namely TSAM.
We run comparative experiments on four realistic networks to validate the effectiveness of TSAM.
arXiv Detail & Related papers (2020-08-23T11:56:40Z) - DessiLBI: Exploring Structural Sparsity of Deep Networks via
Differential Inclusion Paths [45.947140164621096]
We propose a new approach based on differential inclusions of inverse scale spaces.
We show that DessiLBI unveils "winning tickets" in early epochs.
arXiv Detail & Related papers (2020-07-04T04:40:16Z) - The Heterogeneity Hypothesis: Finding Layer-Wise Differentiated Network
Architectures [179.66117325866585]
We investigate a design space that is usually overlooked, i.e. adjusting the channel configurations of predefined networks.
We find that this adjustment can be achieved by shrinking widened baseline networks and leads to superior performance.
Experiments are conducted on various networks and datasets for image classification, visual tracking and image restoration.
arXiv Detail & Related papers (2020-06-29T17:59:26Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.