Two-Stream Networks for Object Segmentation in Videos
- URL: http://arxiv.org/abs/2208.04026v1
- Date: Mon, 8 Aug 2022 10:22:42 GMT
- Title: Two-Stream Networks for Object Segmentation in Videos
- Authors: Hannan Lu, Zhi Tian, Lirong Yang, Haibing Ren, Wangmeng Zuo
- Abstract summary: We present a Two-Stream Network (TSN) to segment the seen pixels based on their pixellevel memory retrieval.
A holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance.
The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost.
- Score: 83.1383102535413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing matching-based approaches perform video object segmentation (VOS)
via retrieving support features from a pixel-level memory, while some pixels
may suffer from lack of correspondence in the memory (i.e., unseen), which
inevitably limits their segmentation performance. In this paper, we present a
Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a
conventional pixel-level memory, to segment the seen pixels based on their
pixellevel memory retrieval. (ii) an instance stream for the unseen pixels,
where a holistic understanding of the instance is obtained with dynamic
segmentation heads conditioned on the features of the target instance. (iii) a
pixel division module generating a routing map, with which output embeddings of
the two streams are fused together. The compact instance stream effectively
improves the segmentation accuracy of the unseen pixels, while fusing two
streams with the adaptive routing map leads to an overall performance boost.
Through extensive experiments, we demonstrate the effectiveness of our proposed
TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS
2018 and 87.5% on the DAVIS-2017 validation split.
Related papers
- FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Rethinking BiSeNet For Real-time Semantic Segmentation [6.622485130017622]
BiSeNet has been proved to be a popular two-stream network for real-time segmentation.
We propose a novel structure named Short-Term Dense Concatenate network (STDC) by removing structure redundancy.
arXiv Detail & Related papers (2021-04-27T13:49:47Z) - CE-FPN: Enhancing Channel Information for Object Detection [12.954675966833372]
Feature pyramid network (FPN) has been an effective framework to extract multi-scale features in object detection.
We present a novel channel enhancement network (CE-FPN) with three simple yet effective modules to alleviate these problems.
Our experiments show that CE-FPN achieves competitive performance compared to state-of-the-art FPN-based detectors on MS COCO benchmark.
arXiv Detail & Related papers (2021-03-19T05:51:53Z) - FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point
Cloud Segmentation [30.736361776703568]
Scene understanding based on LiDAR point cloud is an essential task for autonomous cars to drive safely.
Most existing methods simply stack different point attributes/modalities as image channels to increase information capacity.
We design FPS-Net, a convolutional fusion network that exploits the uniqueness and discrepancy among the projected image channels for optimal point cloud segmentation.
arXiv Detail & Related papers (2021-03-01T04:08:28Z) - AINet: Association Implantation for Superpixel Segmentation [82.21559299694555]
We propose a novel textbfAssociation textbfImplantation (AI) module to enable the network to explicitly capture the relations between the pixel and its surrounding grids.
Our method could not only achieve state-of-the-art performance but maintain satisfactory inference efficiency.
arXiv Detail & Related papers (2021-01-26T10:40:13Z) - Learning Category- and Instance-Aware Pixel Embedding for Fast Panoptic
Segmentation [47.26296379603003]
Panoptic segmentation (PS) is a complex scene understanding task.
PS results are simply derived by assigning each pixel to a detected instance or a stuff class.
Our method not only demonstrates fast inference speed but also the first one-stage method to achieve comparable performance to two-stage methods.
arXiv Detail & Related papers (2020-09-28T14:07:50Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.