Learning to Generate Gradients for Test-Time Adaptation via Test-Time Training Layers
- URL: http://arxiv.org/abs/2412.16901v1
- Date: Sun, 22 Dec 2024 07:24:09 GMT
- Title: Learning to Generate Gradients for Test-Time Adaptation via Test-Time Training Layers
- Authors: Qi Deng, Shuaicheng Niu, Ronghao Zhang, Yaofo Chen, Runhao Zeng, Jian Chen, Xiping Hu,
- Abstract summary: Test-time adaptation aims to fine-tune a trained model online using unlabeled testing data.
In this optimization process, unsupervised learning objectives like entropy frequently encounter noisy learning signals.
We employ a learning-to-optimize approach to automatically learn an entropy generator called Meta Gradient Generator.
- Score: 18.921532965557475
- License:
- Abstract: Test-time adaptation (TTA) aims to fine-tune a trained model online using unlabeled testing data to adapt to new environments or out-of-distribution data, demonstrating broad application potential in real-world scenarios. However, in this optimization process, unsupervised learning objectives like entropy minimization frequently encounter noisy learning signals. These signals produce unreliable gradients, which hinder the model ability to converge to an optimal solution quickly and introduce significant instability into the optimization process. In this paper, we seek to resolve these issues from the perspective of optimizer design. Unlike prior TTA using manually designed optimizers like SGD, we employ a learning-to-optimize approach to automatically learn an optimizer, called Meta Gradient Generator (MGG). Specifically, we aim for MGG to effectively utilize historical gradient information during the online optimization process to optimize the current model. To this end, in MGG, we design a lightweight and efficient sequence modeling layer -- gradient memory layer. It exploits a self-supervised reconstruction loss to compress historical gradient information into network parameters, thereby enabling better memorization ability over a long-term adaptation process. We only need a small number of unlabeled samples to pre-train MGG, and then the trained MGG can be deployed to process unseen samples. Promising results on ImageNet-C, R, Sketch, and A indicate that our method surpasses current state-of-the-art methods with fewer updates, less data, and significantly shorter adaptation iterations. Compared with a previous SOTA method SAR, we achieve 7.4% accuracy improvement and 4.2 times faster adaptation speed on ImageNet-C.
Related papers
- Improving Instance Optimization in Deformable Image Registration with Gradient Projection [7.6061804149819885]
Deformable image registration is inherently a multi-objective optimization problem.
These conflicting objectives often lead to poor optimization outcomes.
Deep learning methods have recently gained popularity in this domain due to their efficiency in processing large datasets.
arXiv Detail & Related papers (2024-10-21T08:27:13Z) - Incorporating Test-Time Optimization into Training with Dual Networks for Human Mesh Recovery [35.138312681232264]
We propose a dual-network architecture that unifies the training-time and test-time objectives.
Our method, armed with meta-learning and the dual networks, outperforms state-of-the-art regression-based and optimization-based HMR approaches.
arXiv Detail & Related papers (2024-01-25T12:04:53Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Faster Optimization-Based Meta-Learning Adaptation Phase [0.0]
We introduce Lambda patterns by which we restrict which weight are updated in the network during the adaptation phase.
The experiments conducted have shown that it is possible to significantly improve the MAML method in the following areas.
arXiv Detail & Related papers (2022-06-13T06:57:17Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.