Related papers: Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

URL: http://arxiv.org/abs/2408.10627v1
Date: Tue, 20 Aug 2024 08:08:32 GMT
Title: Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?
Authors: Chen Liang, Qiang Guo, Xiaochao Qu, Luoqi Liu, Ting Liu,
Abstract summary: Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
Score: 22.191260650245443
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

Related papers

ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts [64.93416171745693]
Reasoning Video Object is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query.<n>Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries.<n>We propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges.
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders. We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos [42.88584315033116]
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language.<n>Our key insight is to preserve the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents.<n>REM performs on par with the state-of-the-art on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 12 IoU points out-of-domain.
arXiv Detail & Related papers (2024-10-30T17:59:26Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention [29.62044843067169]
Video object segmentation is a fundamental research problem in computer vision. We propose a new method for self-supervised video object segmentation based on distillation learning of deformable attention.
arXiv Detail & Related papers (2024-01-25T04:39:48Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z)
Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together. We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z)
Adaptive Intermediate Representations for Video Understanding [50.64187463941215]
We introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding. We propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task. We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
arXiv Detail & Related papers (2021-04-14T21:37:23Z)
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time. We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.