Multi-Attention Network for Compressed Video Referring Object
Segmentation
- URL: http://arxiv.org/abs/2207.12622v1
- Date: Tue, 26 Jul 2022 03:00:52 GMT
- Title: Multi-Attention Network for Compressed Video Referring Object
Segmentation
- Authors: Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang,
Laiyun Qing, Qingming Huang and Guorong Li
- Abstract summary: Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
- Score: 103.18477550023513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation aims to segment the object referred by a
given language expression. Existing works typically require compressed video
bitstream to be decoded to RGB frames before being segmented, which increases
computation and storage requirements and ultimately slows the inference down.
This may hamper its application in real-world computing resource limited
scenarios, such as autonomous cars and drones. To alleviate this problem, in
this paper, we explore the referring object segmentation task on compressed
videos, namely on the original video data flow. Besides the inherent difficulty
of the video referring object segmentation task itself, obtaining
discriminative representation from compressed video is also rather challenging.
To address this problem, we propose a multi-attention network which consists of
dual-path dual-attention module and a query-based cross-modal Transformer
module. Specifically, the dual-path dual-attention module is designed to
extract effective representation from compressed data in three modalities,
i.e., I-frame, Motion Vector and Residual. The query-based cross-modal
Transformer firstly models the correlation between linguistic and visual
modalities, and then the fused multi-modality features are used to guide object
queries to generate a content-aware dynamic kernel and to predict final
segmentation masks. Different from previous works, we propose to learn just one
kernel, which thus removes the complicated post mask-matching procedure of
existing methods. Extensive promising experimental results on three challenging
datasets show the effectiveness of our method compared against several
state-of-the-art methods which are proposed for processing RGB data. Source
code is available at: https://github.com/DexiangHong/MANet.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Video Object Segmentation with Dynamic Query Modulation [23.811776213359625]
We propose a query modulation method, termed QMVOS, for object and multi-object segmentation.
Our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks.
arXiv Detail & Related papers (2024-03-18T07:31:39Z) - GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z) - Fully Transformer-Equipped Architecture for End-to-End Referring Video
Object Segmentation [24.814534011440877]
We propose an end-to-end RVOS framework which treats the RVOS task as a mask sequence learning problem.
To capture the object-level spatial context, we have developed the Stacked Transformer.
The model finds the best matching between mask sequence and text query.
arXiv Detail & Related papers (2023-09-21T09:47:47Z) - Spectrum-guided Multi-granularity Referring Video Object Segmentation [56.95836951559529]
Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features.
This causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation.
We propose a Spectrum-guided Multi-granularity approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks.
arXiv Detail & Related papers (2023-07-25T14:35:25Z) - Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for
Robotic Grasping [4.191965713559235]
We propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data.
The Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions.
The evaluation results show a 6-10% improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy.
arXiv Detail & Related papers (2023-03-20T16:09:25Z) - Unsupervised Video Object Segmentation via Prototype Memory Network [5.612292166628669]
Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame.
This challenge requires extracting features for the most salient common objects within a video sequence.
We propose a novel prototype memory network architecture to solve this problem.
arXiv Detail & Related papers (2022-09-08T11:08:58Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.