Related papers: Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical Invariance

Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical Invariance

URL: http://arxiv.org/abs/2305.16319v2
Date: Wed, 11 Oct 2023 20:33:37 GMT
Title: Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical Invariance
Authors: Yinpeng Chen and Xiyang Dai and Dongdong Chen and Mengchen Liu and Lu Yuan and Zicheng Liu and Youzuo Lin
Abstract summary: FINOLA represents each image in the latent space as a first-order autoregressive process. We demonstrate the ability of FINOLA to auto-regress up to a 256x256 feature map. We also leverage FINOLA for self-supervised learning by employing a simple masked prediction approach.
Score: 104.05734286732941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a novel mathematical property applicable to diverse images, referred to as FINOLA (First-Order Norm+Linear Autoregressive). FINOLA represents each image in the latent space as a first-order autoregressive process, in which each regression step simply applies a shared linear model on the normalized value of its immediate neighbor. This intriguing property reveals a mathematical invariance that transcends individual images. Expanding from image grids to continuous coordinates, we unveil the presence of two underlying partial differential equations. We validate the FINOLA property from two distinct angles: image reconstruction and self-supervised learning. Firstly, we demonstrate the ability of FINOLA to auto-regress up to a 256x256 feature map (the same resolution to the image) from a single vector placed at the center, successfully reconstructing the original image by only using three 3x3 convolution layers as decoder. Secondly, we leverage FINOLA for self-supervised learning by employing a simple masked prediction approach. Encoding a single unmasked quadrant block, we autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves highly effective in image classification and object detection tasks, even when integrated into lightweight networks, all without the need for extensive fine-tuning. The code will be made publicly available.

Related papers

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets) We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z)
Pixel2ISDF: Implicit Signed Distance Fields based Human Body Model from Multi-view and Multi-pose Images [67.45882013828256]
We focus on reconstructing clothed humans in the canonical space given multiple views and poses of a human as the input. We learn latent codes on the posed mesh by leveraging multiple input images and then assign the latent codes to the mesh in the canonical space. Our work for reconstructing the human shape on canonical pose achieves 3rd performance on WCPA MVP-Human Body Challenge.
arXiv Detail & Related papers (2022-12-06T05:30:49Z)
NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW. It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z)
Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
We propose a learning-based approach to infer 3D shape and pose of object from a single image. We first infer a volumetric representation in a canonical frame, along with the camera pose. The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame.
arXiv Detail & Related papers (2021-02-11T18:57:10Z)
Neural Hair Rendering [41.25606756188364]
We propose a generic neural-based hair rendering pipeline that can synthesize photo-realistic images from virtual 3D hair models. Key component of our method is a shared latent space to encode appearance-invariant structure information of both domains.
arXiv Detail & Related papers (2020-04-28T04:36:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.