Residual Connections Harm Generative Representation Learning
- URL: http://arxiv.org/abs/2404.10947v3
- Date: Sat, 12 Oct 2024 01:49:15 GMT
- Title: Residual Connections Harm Generative Representation Learning
- Authors: Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Willett, Michael Maire,
- Abstract summary: We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning.
Our modification improves linear probing accuracy for both, notably increasing ImageNet accuracy from 67.8% to 72.7% for MAEs with a VIT-B/16 backbone.
- Score: 22.21222349477351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification improves linear probing accuracy for both, notably increasing ImageNet accuracy from 67.8% to 72.7% for MAEs with a VIT-B/16 backbone, while also boosting generation quality for diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.
Related papers
- Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection [54.433899174017185]
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models.<n>We propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT)<n>NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces.<n>When trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44%.
arXiv Detail & Related papers (2025-07-26T07:44:04Z) - Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning [57.3885832382455]
We show that introducing static network sparsity alone can unlock further scaling potential beyond dense counterparts with state-of-the-art architectures.<n>Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity.
arXiv Detail & Related papers (2025-06-20T17:54:24Z) - Auto-Compressing Networks [59.83547898874152]
We introduce Auto- Networks (ACNs), an architectural variant where additive long feedforward connections from each layer replace traditional short residual connections.<n>ACNs showcase unique property we coin as "auto-compression", the ability of a network to organically compress information during training.<n>We find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, and mitigate catastrophic forgetting.
arXiv Detail & Related papers (2025-06-11T13:26:09Z) - LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter [52.93785843453579]
Blind face restoration from low-quality (LQ) images is a challenging task that requires high-fidelity image reconstruction and the preservation of facial identity.<n>We propose LAFR, a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts.<n>We show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2025-05-29T14:11:16Z) - Masked Autoencoders Are Effective Tokenizers for Diffusion Models [56.08109308294133]
MAETok is an autoencoder that learns semantically rich latent space while maintaining reconstruction fidelity.
MaETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation.
arXiv Detail & Related papers (2025-02-05T18:42:04Z) - LoyalDiffusion: A Diffusion Model Guarding Against Data Replication [6.818344768093927]
Diffusion models can replicate training data, particularly when the training data includes confidential information.
We propose a replication-aware U-Net architecture that incorporates information transfer blocks into skip connections that are less essential for image quality.
Experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.
arXiv Detail & Related papers (2024-12-02T04:41:30Z) - PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE is a self-supervised learning framework that enhances global feature representation of point cloud mask autoencoders.
We show that PseudoNeg-MAE achieves state-of-the-art performance on the ModelNet40 and ScanObjectNN datasets.
arXiv Detail & Related papers (2024-09-24T07:57:21Z) - An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation [6.596361762662328]
We introduce an innovative encoder-decoder network structure enhanced with residual connections.
Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively.
To enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function.
arXiv Detail & Related papers (2024-05-26T05:15:53Z) - Layer-wise Feedback Propagation [53.00944147633484]
We present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors.
LFP assigns rewards to individual connections based on their respective contributions to solving a given task.
We demonstrate its effectiveness in achieving comparable performance to gradient descent on various models and datasets.
arXiv Detail & Related papers (2023-08-23T10:48:28Z) - Effect of Choosing Loss Function when Using T-batching for
Representation Learning on Dynamic Networks [0.0]
T-batching is a valuable technique for training dynamic network models.
We have identified a limitation in the training loss function used with t-batching.
We propose two alternative loss functions that overcome these issues, resulting in enhanced training performance.
arXiv Detail & Related papers (2023-08-13T23:34:36Z) - Unlocking the Potential of Federated Learning for Deeper Models [24.875271131226707]
Federated learning (FL) is a new paradigm for distributed machine learning that allows a global model to be trained across multiple clients.
We propose several technical guidelines based on reducing divergence, such as using wider models and reducing the receptive field.
These approaches can greatly improve the accuracy of FL on deeper models.
arXiv Detail & Related papers (2023-06-05T08:45:44Z) - Sharpness-Aware Minimization Leads to Low-Rank Features [49.64754316927016]
Sharpness-aware minimization (SAM) is a recently proposed method that minimizes the training loss of a neural network.
We show that SAM reduces the feature rank which happens at different layers of a neural network.
We confirm this effect theoretically and check that it can also occur in deep networks.
arXiv Detail & Related papers (2023-05-25T17:46:53Z) - Deep Augmentation: Self-Supervised Learning with Transformations in Activation Space [19.495587566796278]
We introduce Deep Augmentation, an approach to implicit data augmentation using dropout or PCA to transform a targeted layer within a neural network to improve performance and generalization.
We demonstrate Deep Augmentation through extensive experiments on contrastive learning tasks in NLP, computer vision, and graph learning.
arXiv Detail & Related papers (2023-03-25T19:03:57Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - Image Superresolution using Scale-Recurrent Dense Network [30.75380029218373]
Recent advances in the design of convolutional neural network (CNN) have yielded significant improvements in the performance of image super-resolution (SR)
We propose a scale recurrent SR architecture built upon units containing series of dense connections within a residual block (Residual Dense Blocks (RDBs))
Our scale recurrent design delivers competitive performance for higher scale factors while being parametrically more efficient as compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-28T09:18:43Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Adversarial Training Reduces Information and Improves Transferability [81.59364510580738]
Recent results show that features of adversarially trained networks for classification, in addition to being robust, enable desirable properties such as invertibility.
We show that the Adversarial Training can improve linear transferability to new tasks, from which arises a new trade-off between transferability of representations and accuracy on the source task.
arXiv Detail & Related papers (2020-07-22T08:30:16Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z) - Non-Linearities Improve OrigiNet based on Active Imaging for Micro
Expression Recognition [8.112868317921853]
We introduce an active imaging concept to segregate active changes in expressive regions of a video into a single frame.
We propose a shallow CNN network: hybrid local receptive field based augmented learning network (OrigiNet) that efficiently learns significant features of the micro-expressions in a video.
arXiv Detail & Related papers (2020-05-16T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.