PoF: Post-Training of Feature Extractor for Improving Generalization
- URL: http://arxiv.org/abs/2207.01847v1
- Date: Tue, 5 Jul 2022 07:16:59 GMT
- Title: PoF: Post-Training of Feature Extractor for Improving Generalization
- Authors: Ikuro Sato, Ryota Yamada, Masayuki Tanaka, Nakamasa Inoue, Rei
Kawakami
- Abstract summary: We develop a training algorithm that updates the feature extractor part of an already-trained deep model to search a flatter minimum.
Experimental results show that PoF improved model performance against baseline methods.
- Score: 15.27255942938806
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It has been intensively investigated that the local shape, especially
flatness, of the loss landscape near a minimum plays an important role for
generalization of deep models. We developed a training algorithm called PoF:
Post-Training of Feature Extractor that updates the feature extractor part of
an already-trained deep model to search a flatter minimum. The characteristics
are two-fold: 1) Feature extractor is trained under parameter perturbations in
the higher-layer parameter space, based on observations that suggest flattening
higher-layer parameter space, and 2) the perturbation range is determined in a
data-driven manner aiming to reduce a part of test loss caused by the positive
loss curvature. We provide a theoretical analysis that shows the proposed
algorithm implicitly reduces the target Hessian components as well as the loss.
Experimental results show that PoF improved model performance against baseline
methods on both CIFAR-10 and CIFAR-100 datasets for only 10-epoch
post-training, and on SVHN dataset for 50-epoch post-training. Source code is
available at: \url{https://github.com/DensoITLab/PoF-v1
Related papers
- Machine Unlearning in Low-Dimensional Feature Subspace [47.517520054804976]
Machine Unlearning (MU) aims at removing the influence of specific data from a pretrained model while preserving performance on the remaining data.<n>In this work, a novel perspective for MU is presented upon low-dimensional feature subspaces, which gives rise to the potentials of separating the remaining and forgetting data.<n>This separability motivates our LOFT, a method that proceeds unlearning in a LOw-dimensional FeaTure subspace from the pretrained model skithrough principal projections.
arXiv Detail & Related papers (2026-01-30T01:58:38Z) - DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z) - Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles [19.667068548957143]
Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions.<n>These functions are often highly complex and textured, even fractal-like.<n>Noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry.
arXiv Detail & Related papers (2025-05-26T05:26:21Z) - Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning [49.91297276176978]
We propose a novel.
Efficient Fine-Tuning (PEFT) method for point cloud, called Point GST.
Point GST freezes the pre-trained model and introduces a trainable Point Cloud Spectral Adapter (PCSA) to finetune parameters in the spectral domain.
Extensive experiments on challenging point cloud datasets demonstrate that Point GST not only outperforms its fully finetuning counterpart but also significantly reduces trainable parameters.
arXiv Detail & Related papers (2024-10-10T17:00:04Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction [2.778647101651566]
A fundamental problem in supervised learning is to find a good set of features or distance measures.
We propose a supervised dimensionality reduction method, where the outputs of weak learners define the embedding.
We show that the embedding coordinates provide better features for the supervised learning task.
arXiv Detail & Related papers (2024-05-14T10:23:57Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set.
We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss.
We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z) - Towards Sparsification of Graph Neural Networks [9.568566305616656]
We use two state-of-the-art model compression methods to train and prune and sparse training for the sparsification of weight layers in GNNs.
We evaluate and compare the efficiency of both methods in terms of accuracy, training sparsity, and training FLOPs on real-world graphs.
arXiv Detail & Related papers (2022-09-11T01:39:29Z) - Adaptive Self-supervision Algorithms for Physics-informed Neural
Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function.
We study the impact of the location of the collocation points on the trainability of these models.
We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z) - Structured Directional Pruning via Perturbation Orthogonal Projection [13.704348351073147]
A more reasonable approach is to find a sparse minimizer along the flat minimum valley found byNIST.
We propose the structured directional pruning based on projecting the perturbations onto the flat minimum valley.
Experiments show that our method obtains the state-of-the-art pruned accuracy (i.e. 93.97% on VGG16, CIFAR-10 task) without retraining.
arXiv Detail & Related papers (2021-07-12T11:35:47Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Improve SGD Training via Aligning Mini-batches [22.58823484394866]
In-Training Distribution Matching (ITDM) is proposed to improve deep neural networks (DNNs) training and reduce overfitting.
Specifically, ITDM regularizes the feature extractor by matching the moments of distributions of different mini-batches in each iteration of SGD.
arXiv Detail & Related papers (2020-02-23T15:10:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.