MUSTAN: Multi-scale Temporal Context as Attention for Robust Video
Foreground Segmentation
- URL: http://arxiv.org/abs/2402.00918v1
- Date: Thu, 1 Feb 2024 13:47:23 GMT
- Title: MUSTAN: Multi-scale Temporal Context as Attention for Robust Video
Foreground Segmentation
- Authors: Praveen Kumar Pokala, Jaya Sai Kiran Patibandla, Naveen Kumar Pandey,
and Balakrishna Reddy Pailla
- Abstract summary: Video foreground segmentation (VFS) is an important computer vision task wherein one aims to segment the objects under motion from the background.
Most of the current methods are image-based, i.e., rely only on spatial cues while ignoring motion cues.
In this paper, we utilize the temporal information and the spatial cues from the video data to improve OOD performance.
- Score: 2.2232550112727267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video foreground segmentation (VFS) is an important computer vision task
wherein one aims to segment the objects under motion from the background. Most
of the current methods are image-based, i.e., rely only on spatial cues while
ignoring motion cues. Therefore, they tend to overfit the training data and
don't generalize well to out-of-domain (OOD) distribution. To solve the above
problem, prior works exploited several cues such as optical flow, background
subtraction mask, etc. However, having a video data with annotations like
optical flow is a challenging task. In this paper, we utilize the temporal
information and the spatial cues from the video data to improve OOD
performance. However, the challenge lies in how we model the temporal
information given the video data in an interpretable way creates a very
noticeable difference. We therefore devise a strategy that integrates the
temporal context of the video in the development of VFS. Our approach give rise
to deep learning architectures, namely MUSTAN1 and MUSTAN2 and they are based
on the idea of multi-scale temporal context as an attention, i.e., aids our
models to learn better representations that are beneficial for VFS. Further, we
introduce a new video dataset, namely Indoor Surveillance Dataset (ISD) for
VFS. It has multiple annotations on a frame level such as foreground binary
mask, depth map, and instance semantic annotations. Therefore, ISD can benefit
other computer vision tasks. We validate the efficacy of our architectures and
compare the performance with baselines. We demonstrate that proposed methods
significantly outperform the benchmark methods on OOD. In addition, the
performance of MUSTAN2 is significantly improved on certain video categories on
OOD data due to ISD.
Related papers
- Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Event-Free Moving Object Segmentation from Moving Ego Vehicle [88.33470650615162]
Moving object segmentation (MOS) in dynamic scenes is an important, challenging, but under-explored research topic for autonomous driving.
Most segmentation methods leverage motion cues obtained from optical flow maps.
We propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow.
arXiv Detail & Related papers (2023-04-28T23:43:10Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - SSAN: Separable Self-Attention Network for Video Representation Learning [11.542048296046524]
We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
arXiv Detail & Related papers (2021-05-27T10:02:04Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.