Entropy-driven Unsupervised Keypoint Representation Learning in Videos
- URL: http://arxiv.org/abs/2209.15404v2
- Date: Tue, 6 Jun 2023 07:23:21 GMT
- Title: Entropy-driven Unsupervised Keypoint Representation Learning in Videos
- Authors: Ali Younes, Simone Schaub-Meyer, Georgia Chalvatzaki
- Abstract summary: We present a novel approach for unsupervised learning of meaningful representations from videos.
We argue that textitlocal entropy of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features.
Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene.
- Score: 7.940371647421243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting informative representations from videos is fundamental for
effectively learning various downstream tasks. We present a novel approach for
unsupervised learning of meaningful representations from videos, leveraging the
concept of image spatial entropy (ISE) that quantifies the per-pixel
information in an image. We argue that \textit{local entropy} of pixel
neighborhoods and their temporal evolution create valuable intrinsic
supervisory signals for learning prominent features. Building on this idea, we
abstract visual features into a concise representation of keypoints that act as
dynamic information transmitters, and design a deep learning model that learns,
purely unsupervised, spatially and temporally consistent representations
\textit{directly} from video frames. Two original information-theoretic losses,
computed from local entropy, guide our model to discover consistent keypoint
representations; a loss that maximizes the spatial information covered by the
keypoints and a loss that optimizes the keypoints' information transportation
over time. We compare our keypoint representation to strong baselines for
various downstream tasks, \eg, learning object dynamics. Our empirical results
show superior performance for our information-driven keypoints that resolve
challenges like attendance to static and dynamic objects or objects abruptly
entering and leaving the scene.
Related papers
- Learning in Factored Domains with Information-Constrained Visual
Representations [14.674830543204317]
We present a model of human factored representation learning based on an altered form of a $beta$-Variational Auto-encoder used in a visual learning task.
Results demonstrate a trade-off in the informational complexity of model latent dimension spaces, between the speed of learning and the accuracy of reconstructions.
arXiv Detail & Related papers (2023-03-30T16:22:10Z) - Palm up: Playing in the Latent Manifold for Unsupervised Pretraining [31.92145741769497]
We propose an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse datasets.
Our key idea is to leverage deep generative models that are pretrained on static datasets and introduce a dynamic model in the latent space.
We then employ an unsupervised reinforcement learning algorithm to explore in this environment and perform unsupervised representation learning on the collected data.
arXiv Detail & Related papers (2022-10-19T22:26:12Z) - Self-supervised Sequential Information Bottleneck for Robust Exploration
in Deep Reinforcement Learning [28.75574762244266]
In this work, we introduce the sequential information bottleneck objective for learning compressed and temporally coherent representations.
For efficient exploration in noisy environments, we further construct intrinsic rewards that capture task-relevant state novelty.
arXiv Detail & Related papers (2022-09-12T15:41:10Z) - Stochastic Coherence Over Attention Trajectory For Continuous Learning
In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream.
The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations.
Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel
Space [43.654464513994164]
We present a method for learning causal relationships in high-dimensional data (images, videos)
Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties.
We introduce a new challenging and carefully designed counterfactual benchmark for predictions in pixel space.
arXiv Detail & Related papers (2022-02-01T12:18:30Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.