Spectrum-guided Multi-granularity Referring Video Object Segmentation
- URL: http://arxiv.org/abs/2307.13537v1
- Date: Tue, 25 Jul 2023 14:35:25 GMT
- Title: Spectrum-guided Multi-granularity Referring Video Object Segmentation
- Authors: Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian
- Abstract summary: Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features.
This causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation.
We propose a Spectrum-guided Multi-granularity approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks.
- Score: 56.95836951559529
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current referring video object segmentation (R-VOS) techniques extract
conditional kernels from encoded (low-resolution) vision-language features to
segment the decoded high-resolution features. We discovered that this causes
significant feature drift, which the segmentation kernels struggle to perceive
during the forward computation. This negatively affects the ability of
segmentation kernels. To address the drift problem, we propose a
Spectrum-guided Multi-granularity (SgMg) approach, which performs direct
segmentation on the encoded features and employs visual details to further
optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion
(SCF) to perform intra-frame global interactions in the spectral domain for
effective multimodal representation. Finally, we extend SgMg to perform
multi-object R-VOS, a new paradigm that enables simultaneous segmentation of
multiple referred objects in a video. This not only makes R-VOS faster, but
also more practical. Extensive experiments show that SgMg achieves
state-of-the-art performance on four video benchmark datasets, outperforming
the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg
enables multi-object R-VOS, runs about 3 times faster while maintaining
satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.
Related papers
- Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - ESDMR-Net: A Lightweight Network With Expand-Squeeze and Dual Multiscale
Residual Connections for Medical Image Segmentation [7.921517156237902]
This paper presents an expand-squeeze dual multiscale residual network ( ESDMR-Net)
It is a fully convolutional network that is well-suited for resource-constrained computing hardware such as mobile devices.
We present experiments on seven datasets from five distinct examples of applications.
arXiv Detail & Related papers (2023-12-17T02:15:49Z) - Global Spectral Filter Memory Network for Video Object Segmentation [33.42697528492191]
This paper studies semi-supervised video object segmentation through boosting intra-frame interaction.
We propose Global Spectral Filter Memory network (GSFM), which improves intra-frame interaction through learning long-term spatial dependencies in the spectral domain.
arXiv Detail & Related papers (2022-10-11T16:02:02Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - FAMINet: Learning Real-time Semi-supervised Video Object Segmentation
with Steepest Optimized Optical Flow [21.45623125216448]
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame.
The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy.
A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem.
arXiv Detail & Related papers (2021-11-20T07:24:33Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Efficient Video Object Segmentation with Compressed Video [36.192735485675286]
We propose an efficient framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video.
Our method performs inference on selected vectors and makes predictions for other frames via propagation based on motion and residuals from the compressed video bitstream.
With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
arXiv Detail & Related papers (2021-07-26T12:57:04Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.