Self-supervised Learning by View Synthesis
- URL: http://arxiv.org/abs/2304.11330v1
- Date: Sat, 22 Apr 2023 06:12:13 GMT
- Title: Self-supervised Learning by View Synthesis
- Authors: Shaoteng Liu, Xiangyu Zhang, Tao Hu, Jiaya Jia
- Abstract summary: We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers.
In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose.
- Score: 62.27092994474443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present view-synthesis autoencoders (VSA) in this paper, which is a
self-supervised learning framework designed for vision transformers. Different
from traditional 2D pretraining methods, VSA can be pre-trained with multi-view
data. In each iteration, the input to VSA is one view (or multiple views) of a
3D object and the output is a synthesized image in another target pose. The
decoder of VSA has several cross-attention blocks, which use the source view as
value, source pose as key, and target pose as query. They achieve
cross-attention to synthesize the target view. This simple approach realizes
large-angle view synthesis and learns spatial invariant representation, where
the latter is decent initialization for transformers on downstream tasks, such
as 3D classification on ModelNet40, ShapeNet Core55, and ScanObjectNN. VSA
outperforms existing methods significantly for linear probing and is
competitive for fine-tuning. The code will be made publicly available.
Related papers
- CVSformer: Cross-View Synthesis Transformer for Semantic Scene
Completion [0.0]
We propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships.
We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels.
We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
arXiv Detail & Related papers (2023-07-16T04:08:03Z) - Partial-View Object View Synthesis via Filtered Inversion [77.282967562509]
FINV learns shape priors by training a 3D generative model.
We show that FINV successfully synthesizes novel views of real-world objects.
arXiv Detail & Related papers (2023-04-03T00:59:31Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Novel View Synthesis from a Single Image via Unsupervised learning [27.639536023956122]
We propose an unsupervised network to learn such a pixel transformation from a single source viewpoint.
The learned transformation allows us to synthesize a novel view from any single source viewpoint image of unknown pose.
arXiv Detail & Related papers (2021-10-29T06:32:49Z) - Geometry-Free View Synthesis: Transformers and no 3D Priors [16.86600007830682]
We show that a transformer-based model can synthesize entirely novel views without any hand-engineered 3D biases.
This is achieved by (i) a global attention mechanism for implicitly learning long-range 3D correspondences between source and target views.
arXiv Detail & Related papers (2021-04-15T17:58:05Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Stable View Synthesis [100.86844680362196]
We present Stable View Synthesis (SVS)
Given a set of source images depicting a scene from freely distributed viewpoints, SVS synthesizes new views of the scene.
SVS outperforms state-of-the-art view synthesis methods both quantitatively and qualitatively on three diverse real-world datasets.
arXiv Detail & Related papers (2020-11-14T07:24:43Z) - Continuous Object Representation Networks: Novel View Synthesis without
Target View Supervision [26.885846254261626]
Continuous Object Representation Networks (CORN) is a conditional architecture that encodes an input image's geometry and appearance that map to a 3D consistent scene representation.
CORN achieves well on challenging tasks such as novel view synthesis and single-view 3D reconstruction and performance comparable to state-of-the-art approaches that use direct supervision.
arXiv Detail & Related papers (2020-07-30T17:49:44Z) - Single-View View Synthesis with Multiplane Images [64.46556656209769]
We apply deep learning to generate multiplane images given two or more input images at known viewpoints.
Our method learns to predict a multiplane image directly from a single image input.
It additionally generates reasonable depth maps and fills in content behind the edges of foreground objects in background layers.
arXiv Detail & Related papers (2020-04-23T17:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.