Learning the What and How of Annotation in Video Object Segmentation
- URL: http://arxiv.org/abs/2311.04414v2
- Date: Sat, 11 Nov 2023 19:15:57 GMT
- Title: Learning the What and How of Annotation in Video Object Segmentation
- Authors: Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos
- Abstract summary: Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
- Score: 11.012995995497029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Object Segmentation (VOS) is crucial for several applications, from
video editing to video data generation. Training a VOS model requires an
abundance of manually labeled training videos. The de-facto traditional way of
annotating objects requires humans to draw detailed segmentation masks on the
target objects at each video frame. This annotation process, however, is
tedious and time-consuming. To reduce this annotation cost, in this paper, we
propose EVA-VOS, a human-in-the-loop annotation framework for video object
segmentation. Unlike the traditional approach, we introduce an agent that
predicts iteratively both which frame ("What") to annotate and which annotation
type ("How") to use. Then, the annotator annotates only the selected frame that
is used to update a VOS module, leading to significant gains in annotation
time. We conduct experiments on the MOSE and the DAVIS datasets and we show
that: (a) EVA-VOS leads to masks with accuracy close to the human agreement
3.5x faster than the standard way of annotating videos; (b) our frame selection
achieves state-of-the-art performance; (c) EVA-VOS yields significant
performance gains in terms of annotation time compared to all other methods and
baselines.
Related papers
- PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - One-shot Training for Video Object Segmentation [11.52321103793505]
Video Object (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects.
Previous VOS works typically rely on fully annotated videos for training.
We propose a general one-shot training framework for VOS, requiring only a single labeled frame per training video.
arXiv Detail & Related papers (2024-05-22T21:37:08Z) - Point-VOS: Pointing Up Video Object Segmentation [16.359861197595986]
Current state-of-the-art Video Object (VOS) methods rely on dense per-object mask annotations both during training and testing.
We propose a novel Point-VOS task with a sparse-temporally point-wise annotation scheme that substantially reduces the effort.
We show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task.
arXiv Detail & Related papers (2024-02-08T18:52:23Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - HODOR: High-level Object Descriptors for Object Re-segmentation in Video
Learned from Static Images [123.65233334380251]
We propose HODOR: a novel method that effectively leveraging annotated static images for understanding object appearance and scene context.
As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks.
Without any architectural modification, HODOR can also learn from video context around single annotated video frames.
arXiv Detail & Related papers (2021-12-16T18:59:53Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.