MDQE: Mining Discriminative Query Embeddings to Segment Occluded
Instances on Challenging Videos
- URL: http://arxiv.org/abs/2303.14395v1
- Date: Sat, 25 Mar 2023 08:13:36 GMT
- Title: MDQE: Mining Discriminative Query Embeddings to Segment Occluded
Instances on Challenging Videos
- Authors: Minghan Li and Shuai Li and Wangmeng Xiang and Lei Zhang
- Abstract summary: We propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos.
The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos.
- Score: 18.041697331616948
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While impressive progress has been achieved, video instance segmentation
(VIS) methods with per-clip input often fail on challenging videos with
occluded objects and crowded scenes. This is mainly because instance queries in
these methods cannot encode well the discriminative embeddings of instances,
making the query-based segmenter difficult to distinguish those `hard'
instances. To address these issues, we propose to mine discriminative query
embeddings (MDQE) to segment occluded instances on challenging videos. First,
we initialize the positional embeddings and content features of object queries
by considering their spatial contextual information and the inter-frame object
motion. Second, we propose an inter-instance mask repulsion loss to distance
each instance from its nearby non-target instances. The proposed MDQE is the
first VIS method with per-clip input that achieves state-of-the-art results on
challenging videos and competitive performance on simple videos. In specific,
MDQE with ResNet50 achieves 33.0\% and 44.5\% mask AP on OVIS and YouTube-VIS
2021, respectively. Code of MDQE can be found at
\url{https://github.com/MinghanLi/MDQE_CVPR2023}.
Related papers
- PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - OW-VISCap: Open-World Video Instance Segmentation and Captioning [95.6696714640357]
We propose an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
We generate rich descriptive and object-centric captions for each detected object via a masked attention augmented LLM input.
Our approach matches or surpasses state-of-the-art on three tasks.
arXiv Detail & Related papers (2024-04-04T17:59:58Z) - What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - QueryInst: Parallelly Supervised Mask Query for Instance Segmentation [53.5613957875507]
We present QueryInst, a query based instance segmentation method driven by parallel supervision on dynamic mask heads.
We conduct extensive experiments on three challenging benchmarks, i.e., COCO, CityScapes, and YouTube-VIS.
QueryInst achieves the best performance among all online VIS approaches and strikes a decent speed-accuracy trade-off.
arXiv Detail & Related papers (2021-05-05T08:38:25Z) - Occluded Video Instance Segmentation [133.80567761430584]
We collect a large scale dataset called OVIS for occluded video instance segmentation.
OVIS consists of 296k high-quality instance masks from 25 semantic categories.
The highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.
arXiv Detail & Related papers (2021-02-02T15:35:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.