Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer
- URL: http://arxiv.org/abs/2211.11293v3
- Date: Sat, 22 Jun 2024 10:12:43 GMT
- Title: Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer
- Authors: Hao Shi, Qi Jiang, Kailun Yang, Xiaoting Yin, Ze Wang, Kaiwei Wang,
- Abstract summary: FlowLens architecture explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation.
In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view.
Experiments and user studies involving offline and online video inpainting, as well as beyondFo-V perception tasks, demonstrate that Flows achieves state-of-the-art performance.
- Score: 28.326852785609788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision sensors are widely applied in vehicles, robots, and roadside infrastructure. However, due to limitations in hardware cost and system size, camera Field-of-View (FoV) is often restricted and may not provide sufficient coverage. Nevertheless, from a spatiotemporal perspective, it is possible to obtain information beyond the camera's physical FoV from past video streams. In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety. To achieve this, we introduce the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation. FlowLens offers two key features: 1) FlowLens includes a newly designed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated over time. 2) It integrates a multi-branch Mix Fusion Feed Forward Network (MixF3N) to enhance the precise spatial flow of local features. To facilitate training and evaluation, we derive the KITTI360 dataset with various FoV mask, which covers both outer- and inner FoV expansion scenarios. We also conduct both quantitative assessments and qualitative comparisons of beyond-FoV semantics and beyond-FoV object detection across different models. We illustrate that employing FlowLens to reconstruct unseen scenes even enhances perception within the field of view by providing reliable semantic context. Extensive experiments and user studies involving offline and online video inpainting, as well as beyond-FoV perception tasks, demonstrate that FlowLens achieves state-of-the-art performance. The source code and dataset are made publicly available at https://github.com/MasterHow/FlowLens.
Related papers
- Radiance Field Learners As UAV First-Person Viewers [36.59524833437512]
First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs)
Traditional Neural Radiance Field (NeNeRF) methods face challenges such as sampling single points per granularity.
We introduce FPV-NeRF, addressing these challenges through three key facets.
arXiv Detail & Related papers (2024-08-10T12:29:11Z) - Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction [14.866463843514156]
Let Occ Flow is the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs.
Our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies.
Our method extends differentiable rendering to 3D volumetric flow fields.
arXiv Detail & Related papers (2024-07-10T12:20:11Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - TransFlow: Transformer as Flow Learner [22.727953339383344]
We propose TransFlow, a pure transformer architecture for optical flow estimation.
It provides more accurate correlation and trustworthy matching in flow estimation.
It recovers more compromised information in flow estimation through long-range temporal association in dynamic scenes.
arXiv Detail & Related papers (2023-04-23T03:11:23Z) - Monocular BEV Perception of Road Scenes via Front-to-Top View Projection [57.19891435386843]
We present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird's-eye view.
Our model runs at 25 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
arXiv Detail & Related papers (2022-11-15T13:52:41Z) - FoV-Net: Field-of-View Extrapolation Using Self-Attention and
Uncertainty [95.11806655550315]
We utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view.
We propose a temporally consistent field-of-view extrapolation framework, namely FoV-Net.
Experiments show that FoV-Net does not only extrapolate the temporally consistent wide field-of-view scene better than existing alternatives.
arXiv Detail & Related papers (2022-04-04T06:24:03Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Optical Flow Estimation from a Single Motion-blurred Image [66.2061278123057]
Motion blur in an image may have practical interests in fundamental computer vision problems.
We propose a novel framework to estimate optical flow from a single motion-blurred image in an end-to-end manner.
arXiv Detail & Related papers (2021-03-04T12:45:18Z) - Hierarchical Attention Learning of Scene Flow in 3D Point Clouds [28.59260783047209]
This paper studies the problem of scene flow estimation from two consecutive 3D point clouds.
A novel hierarchical neural network with double attention is proposed for learning the correlation of point features in adjacent frames.
Experiments show that the proposed network outperforms the state-of-the-art performance of 3D scene flow estimation.
arXiv Detail & Related papers (2020-10-12T14:56:08Z) - Knowledge Fusion Transformers for Video Action Recognition [0.0]
We present a self-attention based feature enhancer to fuse action knowledge in 3D based- context of the video clip intended to be classified.
We show, how using only one stream and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art.
arXiv Detail & Related papers (2020-09-29T05:13:45Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.