Optimizing Cox Models with Stochastic Gradient Descent: Theoretical Foundations and Practical Guidances
- URL: http://arxiv.org/abs/2408.02839v1
- Date: Mon, 5 Aug 2024 21:25:10 GMT
- Title: Optimizing Cox Models with Stochastic Gradient Descent: Theoretical Foundations and Practical Guidances
- Authors: Lang Zeng, Weijing Tang, Zhao Ren, Ying Ding,
- Abstract summary: gradient descent (SGD) has recently been adapted to optimize Cox models.
We demonstrate that the SGD estimator targets an objective function that is batch-size-dependent.
We provide guidance for selecting batch sizes in SGD applications.
- Score: 9.745755948802499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimizing Cox regression and its neural network variants poses substantial computational challenges in large-scale studies. Stochastic gradient descent (SGD), known for its scalability in model optimization, has recently been adapted to optimize Cox models. Unlike its conventional application, which typically targets a sum of independent individual loss, SGD for Cox models updates parameters based on the partial likelihood of a subset of data. Despite its empirical success, the theoretical foundation for optimizing Cox partial likelihood with SGD is largely underexplored. In this work, we demonstrate that the SGD estimator targets an objective function that is batch-size-dependent. We establish that the SGD estimator for the Cox neural network (Cox-NN) is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression, we further prove the $\sqrt{n}$-consistency and asymptotic normality of the SGD estimator, with variance depending on the batch size. Furthermore, we quantify the impact of batch size on Cox-NN training and its effect on the SGD estimator's asymptotic efficiency in Cox regression. These findings are validated by extensive numerical experiments and provide guidance for selecting batch sizes in SGD applications. Finally, we demonstrate the effectiveness of SGD in a real-world application where GD is unfeasible due to the large scale of data.
Related papers
- Comparison of the Cox proportional hazards model and Random Survival Forest algorithm for predicting patient-specific survival probabilities in clinical trial data [0.0]
Cox proportional hazards model is often used for model development in randomized controlled trials (RCT) with time-to-event outcomes.
Random survival forests (RSF) is a machine-learning algorithm known for its high predictive performance.
We conduct a comprehensive neutral comparison study to compare the predictive performance of Cox regression and RSF in real-world as well as simulated data.
arXiv Detail & Related papers (2025-02-05T12:26:43Z) - On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization.
This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD)
We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z) - Deep Partially Linear Transformation Model for Right-Censored Survival Data [9.991327369572819]
This paper introduces a deep partially linear transformation model (DPLTM) as a general and flexible framework for estimation, inference and prediction.
Comprehensive simulation studies demonstrate the impressive performance of the proposed estimation procedure in terms of both estimation accuracy and prediction power.
arXiv Detail & Related papers (2024-12-10T15:50:43Z) - Efficient adjustment for complex covariates: Gaining efficiency with
DOPE [56.537164957672715]
We propose a framework that accommodates adjustment for any subset of information expressed by the covariates.
Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estorimator (DOPE) for efficient estimation of the average treatment effect (ATE)
Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings.
arXiv Detail & Related papers (2024-02-20T13:02:51Z) - Bayesian Optimization through Gaussian Cox Process Models for
Spatio-temporal Data [27.922624489449017]
We propose a novel maximum a posteriori inference of Gaussian Cox processes.
We further develop a Nystr"om approximation for efficient computation.
arXiv Detail & Related papers (2024-01-25T22:26:15Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - Differentially private training of neural networks with Langevin
dynamics forcalibrated predictive uncertainty [58.730520380312676]
We show that differentially private gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models.
This represents a serious issue for safety-critical applications, e.g. in medical diagnosis.
arXiv Detail & Related papers (2021-07-09T08:14:45Z) - Exploring the Uncertainty Properties of Neural Networks' Implicit Priors
in the Infinite-Width Limit [47.324627920761685]
We use recent theoretical advances that characterize the function-space prior to an ensemble of infinitely-wide NNs as a Gaussian process.
This gives us a better understanding of the implicit prior NNs place on function space.
We also examine the calibration of previous approaches to classification with the NNGP.
arXiv Detail & Related papers (2020-10-14T18:41:54Z) - Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.