A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects
- URL: http://arxiv.org/abs/2506.13552v1
- Date: Mon, 16 Jun 2025 14:39:03 GMT
- Title: A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects
- Authors: Guohuan Xie, Syed Ariff Syed Hesham, Wenya Guo, Bing Li, Ming-Ming Cheng, Guolei Sun, Yun Liu,
- Abstract summary: Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision.<n>VSP has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes.
- Score: 53.15503034595476
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open-Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand-crafted features to modern deep learning paradigms -- spanning from fully convolutional networks to the latest transformer-based architectures -- and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state-of-the-art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real-world applications.
Related papers
- From Waveforms to Pixels: A Survey on Audio-Visual Segmentation [43.79010208565961]
Audio-Visual aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities.<n>We present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies.
arXiv Detail & Related papers (2025-07-29T22:20:51Z) - SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction [65.15449703659772]
Video Object (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames.<n>We propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations.<n>SeC achieves an 11.8-point improvement over SAM SeCVOS, establishing a new state-of-the-art concept-aware video object segmentation.
arXiv Detail & Related papers (2025-07-21T17:59:02Z) - AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation [8.563354084119062]
Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones.<n>We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches.<n>We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI)
arXiv Detail & Related papers (2025-06-01T16:01:24Z) - Towards Open-Vocabulary Video Semantic Segmentation [40.58291642595943]
We introduce the Open Vocabulary Video Semantic (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories.<n>To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module.<n>Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context.
arXiv Detail & Related papers (2024-12-12T14:53:16Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - Deep Learning Techniques for Video Instance Segmentation: A Survey [19.32547752428875]
Video instance segmentation is an emerging computer vision research area introduced in 2019.
Deep-learning techniques take a dominant role in various computer vision areas.
This survey offers a multifaceted view of deep-learning schemes for video instance segmentation.
arXiv Detail & Related papers (2023-10-19T00:27:30Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Video Summarization Using Deep Neural Networks: A Survey [72.98424352264904]
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content.
This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization.
arXiv Detail & Related papers (2021-01-15T11:41:29Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.