Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation
- URL: http://arxiv.org/abs/2303.10383v2
- Date: Sat, 3 Feb 2024 10:04:44 GMT
- Title: Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation
- Authors: Xiaoqi Zhao, Shijie Chang, Youwei Pang, Jiaxing Yang, Lihe Zhang,
Huchuan Lu
- Abstract summary: We propose a novel adaptive multi-source predictor for zero-shot video object segmentation (ZVOS)
In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously.
Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks.
- Score: 68.56443382421878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Static and moving objects often occur in real-life videos. Most video object
segmentation methods only focus on extracting and exploiting motion cues to
perceive moving objects. Once faced with the frames of static objects, the
moving object predictors may predict failed results caused by uncertain motion
information, such as low-quality optical flow maps. Besides, different sources
such as RGB, depth, optical flow and static saliency can provide useful
information about the objects. However, existing approaches only consider
either the RGB or RGB and optical flow. In this paper, we propose a novel
adaptive multi-source predictor for zero-shot video object segmentation (ZVOS).
In the static object predictor, the RGB source is converted to depth and static
saliency sources, simultaneously. In the moving object predictor, we propose
the multi-source fusion structure. First, the spatial importance of each source
is highlighted with the help of the interoceptive spatial attention module
(ISAM). Second, the motion-enhanced module (MEM) is designed to generate pure
foreground motion attention for improving the representation of static and
moving features in the decoder. Furthermore, we design a feature purification
module (FPM) to filter the inter-source incompatible features. By using the
ISAM, MEM and FPM, the multi-source features are effectively fused. In
addition, we put forward an adaptive predictor fusion network (APF) to evaluate
the quality of the optical flow map and fuse the predictions from the static
object predictor and the moving object predictor in order to prevent
over-reliance on the failed results caused by low-quality optical flow maps.
Experiments show that the proposed model outperforms the state-of-the-art
methods on three challenging ZVOS benchmarks. And, the static object predictor
precisely predicts a high-quality depth map and static saliency map at the same
time.
Related papers
- Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring [71.60457491155451]
Eliminating image blur produced by various kinds of motion has been a challenging problem.
We propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative Filter.
Our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-19T19:44:24Z) - Treating Motion as Option with Output Selection for Unsupervised Video
Object Segmentation [17.71871884366252]
Video object segmentation (VOS) aims to detect the most salient object in a video without external guidance about the object.
Recent methods collaboratively use motion cues extracted from optical flow maps with appearance cues extracted from RGB images.
We propose a novel motion-as-option network by treating motion cues as optional.
arXiv Detail & Related papers (2023-09-26T09:34:13Z) - MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation [23.615122326731115]
We propose a novel solution that makes use of RGB video streams.
Our framework consists of three modules: a scale-aware monocular dense SLAM solution, a lightweight object pose predictor, and an object-level pose graph.
Our experimental results demonstrate that when utilizing public dataset sequences with high-quality depth information, the proposed method exhibits comparable performance to state-of-the-art RGB-D methods.
arXiv Detail & Related papers (2023-08-17T08:29:54Z) - FOLT: Fast Multiple Object Tracking from UAV-captured Videos Based on
Optical Flow [27.621524657473945]
Multiple object tracking (MOT) has been successfully investigated in computer vision.
However, MOT for the videos captured by unmanned aerial vehicles (UAV) is still challenging due to small object size, blurred object appearance, and very large and/or irregular motion.
We propose FOLT to mitigate these problems and reach fast and accurate MOT in UAV view.
arXiv Detail & Related papers (2023-08-14T15:24:44Z) - DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object
Detection and Tracking [67.34803048690428]
We propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem.
DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden.
It is flexible and practical that can be plugged into most camera-based 3D object detectors.
arXiv Detail & Related papers (2023-03-29T12:33:55Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - Hierarchical Feature Alignment Network for Unsupervised Video Object
Segmentation [99.70336991366403]
We propose a concise, practical, and efficient architecture for appearance and motion feature alignment.
The proposed HFAN reaches a new state-of-the-art performance on DAVIS-16, achieving 88.7 $mathcalJ&mathcalF$ Mean, i.e., a relative improvement of 3.5% over the best published result.
arXiv Detail & Related papers (2022-07-18T10:10:14Z) - Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot
Video Object Segmentation [86.94578023985677]
We propose a novel multi-source fusion network for zero-shot video object segmentation.
The proposed model achieves compelling performance against the state-of-the-arts.
arXiv Detail & Related papers (2021-08-11T07:37:44Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.