On Negative Transfer and Structure of Latent Functions in Multi-output
Gaussian Processes
- URL: http://arxiv.org/abs/2004.02382v1
- Date: Mon, 6 Apr 2020 02:47:30 GMT
- Title: On Negative Transfer and Structure of Latent Functions in Multi-output
Gaussian Processes
- Authors: Moyan Li, Raed Kontar
- Abstract summary: In this article, we first define negative transfer in the context of an $mathcalMGP$ and then derive necessary conditions for an $mathcalMGP$ model to avoid negative transfer.
We show that avoiding negative transfer is mainly dependent on having a sufficient number of latent functions $Q$.
We propose two latent structures that scale to arbitrarily large datasets, can avoid negative transfer and allow any kernel or sparse approximations to be used within.
- Score: 2.538209532048867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-output Gaussian process ($\mathcal{MGP}$) is based on the
assumption that outputs share commonalities, however, if this assumption does
not hold negative transfer will lead to decreased performance relative to
learning outputs independently or in subsets. In this article, we first define
negative transfer in the context of an $\mathcal{MGP}$ and then derive
necessary conditions for an $\mathcal{MGP}$ model to avoid negative transfer.
Specifically, under the convolution construction, we show that avoiding
negative transfer is mainly dependent on having a sufficient number of latent
functions $Q$ regardless of the flexibility of the kernel or inference
procedure used. However, a slight increase in $Q$ leads to a large increase in
the number of parameters to be estimated. To this end, we propose two latent
structures that scale to arbitrarily large datasets, can avoid negative
transfer and allow any kernel or sparse approximations to be used within. These
structures also allow regularization which can provide consistent and automatic
selection of related outputs.
Related papers
- Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Outlier Suppression: Pushing the Limit of Low-bit Transformer Language
Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance.
We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping.
This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z) - Learning a Single Neuron for Non-monotonic Activation Functions [3.890410443467757]
Non-monotonic activation functions outperform the traditional monotonic ones in many applications.
We show that mild conditions on $sigma$ are sufficient to guarantee the learnability in samples time.
We also discuss how our positive results are related to existing negative results on training two-layer neural networks.
arXiv Detail & Related papers (2022-02-16T13:44:25Z) - Deterministic transformations between unitary operations: Exponential
advantage with adaptive quantum circuits and the power of indefinite
causality [0.0]
We show that when $f$ is an anti-homomorphism, sequential circuits could exponentially outperform parallel ones.
We present explicit constructions on how to obtain such advantages for the unitary inversion task $f(U)=U-1$ and the unitary transposition task $f(U)=UT$.
arXiv Detail & Related papers (2021-09-16T20:10:05Z) - Convergence Rates of Stochastic Gradient Descent under Infinite Noise
Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios.
We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance.
Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z) - Exponentially Weighted l_2 Regularization Strategy in Constructing
Reinforced Second-order Fuzzy Rule-based Model [72.57056258027336]
In the conventional Takagi-Sugeno-Kang (TSK)-type fuzzy models, constant or linear functions are usually utilized as the consequent parts of the fuzzy rules.
We introduce an exponential weight approach inspired by the weight function theory encountered in harmonic analysis.
arXiv Detail & Related papers (2020-07-02T15:42:15Z) - Linear Time Sinkhorn Divergences using Positive Features [51.50788603386766]
Solving optimal transport with an entropic regularization requires computing a $ntimes n$ kernel matrix that is repeatedly applied to a vector.
We propose to use instead ground costs of the form $c(x,y)=-logdotpvarphi(x)varphi(y)$ where $varphi$ is a map from the ground space onto the positive orthant $RRr_+$, with $rll n$.
arXiv Detail & Related papers (2020-06-12T10:21:40Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - Scalable Variational Gaussian Process Regression Networks [19.699020509495437]
We propose a scalable variational inference algorithm for GPRN.
We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations.
We demonstrate the advantages of our method in several real-world applications.
arXiv Detail & Related papers (2020-03-25T16:39:47Z) - New Bounds For Distributed Mean Estimation and Variance Reduction [25.815612182815702]
We consider the problem of distributed mean estimation (DME) in which $n$ machines are each given a local $d$-dimensional vector $x_v in mathbbRd$.
We show that our method yields practical improvements for common applications, relative to prior approaches.
arXiv Detail & Related papers (2020-02-21T13:27:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.