1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation
- URL: http://arxiv.org/abs/2406.05352v1
- Date: Sat, 8 Jun 2024 04:43:08 GMT
- Title: 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation
- Authors: Qingfeng Liu, Mostafa El-Khamy, Kee-Bong Song,
- Abstract summary: Third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding.
This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge.
Our solution stands on the shoulders of giant vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance frameworks.
- Score: 11.331198234997714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.
Related papers
- LAVIB: A Large-scale Video Interpolation Benchmark [58.194606275650095]
LAVIB comprises a large collection of high-resolution videos sourced from the web through an automated pipeline.
Metrics are computed for each video's motion magnitudes, luminance conditions, frame sharpness, and contrast.
In total, LAVIB includes 283K clips from 17K ultra-HD videos, covering 77.6 hours.
arXiv Detail & Related papers (2024-06-14T06:44:01Z) - 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [19.071113992267826]
We introduce a comprehensive approach centered on the query-wise ensemble, supplemented by additional techniques.
Our proposed approach achieved a VPQ score of 57.01 on the VIPSeg test set, and ranked 3rd in the VPS track of the 3rd Pixel-level Video Understanding in the Wild Challenge.
arXiv Detail & Related papers (2024-06-06T12:22:56Z) - 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [12.274092278786966]
Video Panoptic (VPS) aims to simultaneously classify, track, segment all objects in a video.
We propose a robust integrated video panoptic segmentation solution.
Our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases.
arXiv Detail & Related papers (2024-06-01T17:03:16Z) - VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - NTIRE 2023 Quality Assessment of Video Enhancement Challenge [97.809937484099]
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge.
The challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos.
The challenge has a total of 167 registered participants.
arXiv Detail & Related papers (2023-07-19T02:33:42Z) - 1st Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation [25.235404527487784]
Video panoptic segmentation is a challenging task that serves as the cornerstone of numerous downstream applications.
We believe that the decoupling strategy proposed by DVIS enables more effective utilization of temporal information for both "thing" and "stuff" objects.
Our method achieved a VPQ score of 51.4 and 53.7 in the development and test phases, respectively, and ranked 1st in the VPS track of the 2nd PVUW Challenge.
arXiv Detail & Related papers (2023-06-07T01:24:48Z) - 3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic
Segmentation on VSPW [68.56017675820897]
In this paper, we introduce 3rd place solution for PVUW2023 VSS track.
We have explored various image-level visual backbones and segmentation heads to tackle the problem of video semantic segmentation.
arXiv Detail & Related papers (2023-06-04T07:50:38Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.