Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical
Invariance
- URL: http://arxiv.org/abs/2305.16319v2
- Date: Wed, 11 Oct 2023 20:33:37 GMT
- Title: Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical
Invariance
- Authors: Yinpeng Chen and Xiyang Dai and Dongdong Chen and Mengchen Liu and Lu
Yuan and Zicheng Liu and Youzuo Lin
- Abstract summary: FINOLA represents each image in the latent space as a first-order autoregressive process.
We demonstrate the ability of FINOLA to auto-regress up to a 256x256 feature map.
We also leverage FINOLA for self-supervised learning by employing a simple masked prediction approach.
- Score: 104.05734286732941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel mathematical property applicable to diverse
images, referred to as FINOLA (First-Order Norm+Linear Autoregressive). FINOLA
represents each image in the latent space as a first-order autoregressive
process, in which each regression step simply applies a shared linear model on
the normalized value of its immediate neighbor. This intriguing property
reveals a mathematical invariance that transcends individual images. Expanding
from image grids to continuous coordinates, we unveil the presence of two
underlying partial differential equations. We validate the FINOLA property from
two distinct angles: image reconstruction and self-supervised learning.
Firstly, we demonstrate the ability of FINOLA to auto-regress up to a 256x256
feature map (the same resolution to the image) from a single vector placed at
the center, successfully reconstructing the original image by only using three
3x3 convolution layers as decoder. Secondly, we leverage FINOLA for
self-supervised learning by employing a simple masked prediction approach.
Encoding a single unmasked quadrant block, we autoregressively predict the
surrounding masked region. Remarkably, this pre-trained representation proves
highly effective in image classification and object detection tasks, even when
integrated into lightweight networks, all without the need for extensive
fine-tuning. The code will be made publicly available.
Related papers
- MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z) - Pixel2ISDF: Implicit Signed Distance Fields based Human Body Model from
Multi-view and Multi-pose Images [67.45882013828256]
We focus on reconstructing clothed humans in the canonical space given multiple views and poses of a human as the input.
We learn latent codes on the posed mesh by leveraging multiple input images and then assign the latent codes to the mesh in the canonical space.
Our work for reconstructing the human shape on canonical pose achieves 3rd performance on WCPA MVP-Human Body Challenge.
arXiv Detail & Related papers (2022-12-06T05:30:49Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z) - Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
We propose a learning-based approach to infer 3D shape and pose of object from a single image.
We first infer a volumetric representation in a canonical frame, along with the camera pose.
The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame.
arXiv Detail & Related papers (2021-02-11T18:57:10Z) - Neural Hair Rendering [41.25606756188364]
We propose a generic neural-based hair rendering pipeline that can synthesize photo-realistic images from virtual 3D hair models.
Key component of our method is a shared latent space to encode appearance-invariant structure information of both domains.
arXiv Detail & Related papers (2020-04-28T04:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.