Understanding Road Layout from Videos as a Whole
- URL: http://arxiv.org/abs/2007.00822v1
- Date: Thu, 2 Jul 2020 00:59:15 GMT
- Title: Understanding Road Layout from Videos as a Whole
- Authors: Buyu Liu, Bingbing Zhuang, Samuel Schulter, Pan Ji, Manmohan
Chandraker
- Abstract summary: We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
- Score: 82.30800791500869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the problem of inferring the layout of complex road
scenes from video sequences. To this end, we formulate it as a top-view road
attributes prediction problem and our goal is to predict these attributes for
each frame both accurately and consistently. In contrast to prior work, we
exploit the following three novel aspects: leveraging camera motions in videos,
including context cuesand incorporating long-term video information.
Specifically, we introduce a model that aims to enforce prediction consistency
in videos. Our model consists of one LSTM and one Feature Transform Module
(FTM). The former implicitly incorporates the consistency constraint with its
hidden states, and the latter explicitly takes the camera motion into
consideration when aggregating information along videos. Moreover, we propose
to incorporate context information by introducing road participants, e.g.
objects, into our model. When the entire video sequence is available, our model
is also able to encode both local and global cues, e.g. information from both
past and future frames. Experiments on two data sets show that: (1)
Incorporating either globalor contextual cues improves the prediction accuracy
and leveraging both gives the best performance. (2) Introducing the LSTM and
FTM modules improves the prediction consistency in videos. (3) The proposed
method outperforms the SOTA by a large margin.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Semantic Segmentation on VSPW Dataset through Masked Video Consistency [19.851665554201407]
We present our solution for the PVUW competition, where we introduce masked video (MVC) based on existing models.
MVC enforces consistency between predictions of masked random frames where patches are withheld.
Our approach achieves 67% mIoU performance on the VSPW dataset, ranking 2nd in the PVUW2024 VSS track.
arXiv Detail & Related papers (2024-06-07T14:41:24Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - Mutual Information Based Method for Unsupervised Disentanglement of
Video Representation [0.0]
Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation.
One of the major challenges in future frame generation is due to the high dimensional nature of visual data.
We propose Mutual Information Predictive Auto-Encoder framework, that reduces the task of predicting high dimensional video frames.
arXiv Detail & Related papers (2020-11-17T13:16:07Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.