FoV-Net: Field-of-View Extrapolation Using Self-Attention and
Uncertainty
- URL: http://arxiv.org/abs/2204.01267v1
- Date: Mon, 4 Apr 2022 06:24:03 GMT
- Title: FoV-Net: Field-of-View Extrapolation Using Self-Attention and
Uncertainty
- Authors: Liqian Ma, Stamatios Georgoulis, Xu Jia, Luc Van Gool
- Abstract summary: We utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view.
We propose a temporally consistent field-of-view extrapolation framework, namely FoV-Net.
Experiments show that FoV-Net does not only extrapolate the temporally consistent wide field-of-view scene better than existing alternatives.
- Score: 95.11806655550315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to make educated predictions about their surroundings, and
associate them with certain confidence, is important for intelligent systems,
like autonomous vehicles and robots. It allows them to plan early and decide
accordingly. Motivated by this observation, in this paper we utilize
information from a video sequence with a narrow field-of-view to infer the
scene at a wider field-of-view. To this end, we propose a temporally consistent
field-of-view extrapolation framework, namely FoV-Net, that: (1) leverages 3D
information to propagate the observed scene parts from past frames; (2)
aggregates the propagated multi-frame information using an attention-based
feature aggregation module and a gated self-attention module, simultaneously
hallucinating any unobserved scene parts; and (3) assigns an interpretable
uncertainty value at each pixel. Extensive experiments show that FoV-Net does
not only extrapolate the temporally consistent wide field-of-view scene better
than existing alternatives, but also provides the associated uncertainty which
may benefit critical decision-making downstream applications. Project page is
at http://charliememory.github.io/RAL21_FoV.
Related papers
- Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal
Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously.
To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Convolutional Transformer based Dual Discriminator Generative
Adversarial Networks for Video Anomaly Detection [27.433162897608543]
We propose Conversaal Transformer based Dual Discriminator Generative Adrial Networks (CT-D2GAN) to perform unsupervised video anomaly detection.
It contains three key components, i., a convolutional encoder to capture the spatial information of input clips, a temporal self-attention module to encode the temporal dynamics and predict the future frame.
arXiv Detail & Related papers (2021-07-29T03:07:25Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - F2Net: Learning to Focus on the Foreground for Unsupervised Video Object
Segmentation [61.74261802856947]
We propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects.
Our proposed network consists of three main parts: Siamese Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Experiments on DAVIS2016, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.
arXiv Detail & Related papers (2020-12-04T11:30:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.