Attentive VQ-VAE
- URL: http://arxiv.org/abs/2309.11641v2
- Date: Thu, 8 Feb 2024 20:52:25 GMT
- Title: Attentive VQ-VAE
- Authors: Angello Hoyos and Mariano Rivera
- Abstract summary: We present a novel approach to enhance the capabilities of VQ-VAE models through the integration of a Residual encoder and a Residual Pixel Attention layer, named Attentive Residual (AREN)
The AREN is designed to operate effectively at multiple levels, accommodating diverse architectural complexities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a novel approach to enhance the capabilities of VQ-VAE models
through the integration of a Residual Encoder and a Residual Pixel Attention
layer, named Attentive Residual Encoder (AREN). The objective of our research
is to improve the performance of VQ-VAE while maintaining practical parameter
levels. The AREN encoder is designed to operate effectively at multiple levels,
accommodating diverse architectural complexities. The key innovation is the
integration of an inter-pixel auto-attention mechanism into the AREN encoder.
This approach allows us to efficiently capture and utilize contextual
information across latent vectors. Additionally, our models uses additional
encoding levels to further enhance the model's representational power. Our
attention layer employs a minimal parameter approach, ensuring that latent
vectors are modified only when pertinent information from other pixels is
available. Experimental results demonstrate that our proposed modifications
lead to significant improvements in data representation and generation, making
VQ-VAEs even more suitable for a wide range of applications as the presented.
Related papers
- Quantum Down Sampling Filter for Variational Auto-encoder [0.504868948270058]
Variational Autoencoders (VAEs) are essential tools in generative modeling and image reconstruction.
This study aims to improve the quality of reconstructed images by enhancing their resolution and preserving finer details.
We propose a hybrid model that combines quantum computing techniques in the VAE encoder with convolutional neural networks (CNNs) in the decoder.
arXiv Detail & Related papers (2025-01-09T11:08:55Z) - HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information.
HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z) - LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models [27.795088366122297]
We introduce LiteVAE, a new autoencoder design for latent diffusion models (LDMs)
LiteVAE uses the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
arXiv Detail & Related papers (2024-05-23T12:06:00Z) - HAT: Hybrid Attention Transformer for Image Restoration [61.74223315807691]
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising.
We propose a new Hybrid Attention Transformer (HAT) to activate more input pixels for better restoration.
Our HAT achieves state-of-the-art performance both quantitatively and qualitatively.
arXiv Detail & Related papers (2023-09-11T05:17:55Z) - Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient
Neural Image Compression [11.25130799452367]
We propose an absolute image compression transformer (ICT) for neural image compression (NIC)
ICT captures both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents.
Our framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural SwinT-ChARM.
arXiv Detail & Related papers (2023-07-05T13:17:14Z) - Vector Quantized Wasserstein Auto-Encoder [57.29764749855623]
We study learning deep discrete representations from the generative viewpoint.
We endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution.
We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution.
arXiv Detail & Related papers (2023-02-12T13:51:36Z) - Hierarchical Residual Learning Based Vector Quantized Variational
Autoencoder for Image Reconstruction and Generation [19.92324010429006]
We propose a multi-layer variational autoencoder method, we call HR-VQVAE, that learns hierarchical discrete representations of the data.
We evaluate our method on the tasks of image reconstruction and generation.
arXiv Detail & Related papers (2022-08-09T06:04:25Z) - Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution.
We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction.
Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.