Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts
- URL: http://arxiv.org/abs/2410.12258v2
- Date: Thu, 06 Mar 2025 02:46:29 GMT
- Title: Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts
- Authors: Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho,
- Abstract summary: We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts.<n>This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks.
- Score: 24.665178287368974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scale pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero during the training, leading to the prompt vanishing issue; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equations and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also investigate various types of expert structure to understand their effects on the convergence behavior of parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds. Finally, we run several numerical experiments to empirically justify our theoretical findings.
Related papers
- Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Least Squares Regression Can Exhibit Under-Parameterized Double Descent [6.645111950779666]
We study the relationship between the number of training data points, the number of parameters, and the generalization capabilities of models.
We postulate that the location of the peak depends on the technical properties of both the spectrum as well as the eigenvectors of the sample covariance.
arXiv Detail & Related papers (2023-05-24T03:52:48Z) - Towards Convergence Rates for Parameter Estimation in Gaussian-gated
Mixture of Experts [40.24720443257405]
We provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model.
Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions.
Notably, these behaviors can be characterized by the solvability of two different systems of equations.
arXiv Detail & Related papers (2023-05-12T16:02:19Z) - Mitigating multiple descents: A model-agnostic framework for risk
monotonization [84.6382406922369]
We develop a general framework for risk monotonization based on cross-validation.
We propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting.
arXiv Detail & Related papers (2022-05-25T17:41:40Z) - Parameters or Privacy: A Provable Tradeoff Between Overparameterization
and Membership Inference [29.743945643424553]
Over parameterized models generalize well (small error on the test data) even when trained to memorize the training data (zero error on the training data)
This has led to an arms race towards increasingly over parameterized models (c.f., deep learning)
arXiv Detail & Related papers (2022-02-02T19:00:21Z) - Evaluating Sensitivity to the Stick-Breaking Prior in Bayesian
Nonparametrics [85.31247588089686]
We show that variational Bayesian methods can yield sensitivities with respect to parametric and nonparametric aspects of Bayesian models.
We provide both theoretical and empirical support for our variational approach to Bayesian sensitivity analysis.
arXiv Detail & Related papers (2021-07-08T03:40:18Z) - MINIMALIST: Mutual INformatIon Maximization for Amortized Likelihood
Inference from Sampled Trajectories [61.3299263929289]
Simulation-based inference enables learning the parameters of a model even when its likelihood cannot be computed in practice.
One class of methods uses data simulated with different parameters to infer an amortized estimator for the likelihood-to-evidence ratio.
We show that this approach can be formulated in terms of mutual information between model parameters and simulated data.
arXiv Detail & Related papers (2021-06-03T12:59:16Z) - Causal Inference Under Unmeasured Confounding With Negative Controls: A
Minimax Learning Approach [84.29777236590674]
We study the estimation of causal parameters when not all confounders are observed and instead negative controls are available.
Recent work has shown how these can enable identification and efficient estimation via two so-called bridge functions.
arXiv Detail & Related papers (2021-03-25T17:59:19Z) - Asymptotic Behavior of Adversarial Training in Binary Classification [41.7567932118769]
Adversarial training is considered to be the state-of-the-art method for defense against adversarial attacks.
Despite being successful in practice, several problems in understanding performance of adversarial training remain open.
We derive precise theoretical predictions for the minimization of adversarial training in binary classification.
arXiv Detail & Related papers (2020-10-26T01:44:20Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z) - High Dimensional Data Enrichment: Interpretable, Fast, and
Data-Efficient [38.40316295019222]
We introduce an estimator for the problem of multiple connected linear regressions known as Data Enrichment/Sharing.
We show that the recovery of the common parameter benefits from emphall of the pooled samples.
Overall, we present a first thorough statistical and computational analysis of inference in the data-sharing model.
arXiv Detail & Related papers (2018-06-11T15:15:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.