Learning Invariant World State Representations with Predictive Coding
- URL: http://arxiv.org/abs/2207.02972v1
- Date: Wed, 6 Jul 2022 21:08:30 GMT
- Title: Learning Invariant World State Representations with Predictive Coding
- Authors: Avi Ziskind, Sujeong Kim, and Giedrius T. Burachas
- Abstract summary: We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method.
We evaluate the robustness of our model on a new synthetic dataset.
- Score: 1.8963850600275547
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning methods overcome the key bottleneck for building
more capable AI: limited availability of labeled data. However, one of the
drawbacks of self-supervised architectures is that the representations that
they learn are implicit and it is hard to extract meaningful information about
the encoded world states, such as 3D structure of the visual scene encoded in a
depth map. Moreover, in the visual domain such representations only rarely
undergo evaluations that may be critical for downstream tasks, such as vision
for autonomous cars. Herein, we propose a framework for evaluating visual
representations for illumination invariance in the context of depth perception.
We develop a new predictive coding-based architecture and a hybrid
fully-supervised/self-supervised learning method. We propose a novel
architecture that extends the predictive coding approach: PRedictive Lateral
bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly
learns to infer and predict depth from video frames. In PreludeNet, the
encoder's stack of predictive coding layers is trained in a self-supervised
manner, while the predictive decoder is trained in a supervised manner to infer
or predict the depth. We evaluate the robustness of our model on a new
synthetic dataset, in which lighting conditions (such as overall illumination,
and effect of shadows) can be be parametrically adjusted while keeping all
other aspects of the world constant. PreludeNet achieves both competitive depth
inference performance and next frame prediction accuracy. We also show how this
new network architecture, coupled with the hybrid
fully-supervised/self-supervised learning method, achieves balance between the
said performance and invariance to changes in lighting. The proposed framework
for evaluating visual representations can be extended to diverse task domains
and invariance tests.
Related papers
- What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - Towards Disentangling Information Paths with Coded ResNeXt [11.884259630414515]
We take a novel approach to enhance the transparency of the function of the whole network.
We propose a neural network architecture for classification, in which the information that is relevant to each class flows through specific paths.
arXiv Detail & Related papers (2022-02-10T21:45:49Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Learning by Distillation: A Self-Supervised Learning Framework for
Optical Flow Estimation [71.76008290101214]
DistillFlow is a knowledge distillation approach to learning optical flow.
It achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets.
Our models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark.
arXiv Detail & Related papers (2021-06-08T09:13:34Z) - Stereo Matching by Self-supervision of Multiscopic Vision [65.38359887232025]
We propose a new self-supervised framework for stereo matching utilizing multiple images captured at aligned camera positions.
A cross photometric loss, an uncertainty-aware mutual-supervision loss, and a new smoothness loss are introduced to optimize the network.
Our model obtains better disparity maps than previous unsupervised methods on the KITTI dataset.
arXiv Detail & Related papers (2021-04-09T02:58:59Z) - Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation.
Our framework can be trained without the help of any manual annotation or pretrained network.
Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z) - Semantically-Guided Representation Learning for Self-Supervised
Monocular Depth [40.49380547487908]
We propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning.
Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.
arXiv Detail & Related papers (2020-02-27T18:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.