VideoClick: Video Object Segmentation with a Single Click
- URL: http://arxiv.org/abs/2101.06545v1
- Date: Sat, 16 Jan 2021 23:07:48 GMT
- Title: VideoClick: Video Object Segmentation with a Single Click
- Authors: Namdar Homayounfar, Justin Liang, Wei-Chiu Ma, Raquel Urtasun
- Abstract summary: We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background.
Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
- Score: 93.7733828038616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Annotating videos with object segmentation masks typically involves a two
stage procedure of drawing polygons per object instance for all the frames and
then linking them through time. While simple, this is a very tedious, time
consuming and expensive process, making the creation of accurate annotations at
scale only possible for well-funded labs. What if we were able to segment an
object in the full video with only a single click? This will enable video
segmentation at scale with a very low budget opening the door to many
applications. Towards this goal, in this paper we propose a bottom up approach
where given a single click for each object in a video, we obtain the
segmentation masks of these objects in the full video. In particular, we
construct a correlation volume that assigns each pixel in a target frame to
either one of the objects in the reference frame or the background. We then
refine this correlation volume via a recurrent attention module and decode the
final segmentation. To evaluate the performance, we label the popular and
challenging Cityscapes dataset with video object segmentations. Results on this
new CityscapesVideo dataset show that our approach outperforms all the
baselines in this challenging setting.
Related papers
- 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - ClickVOS: Click Video Object Segmentation [29.20434078000283]
Video Object (VOS) task aims to segment objects in videos.
To address these limitations, we propose the setting named Click Video Object (ClickVOS)
ClickVOS segments objects of interest across the whole video according to a single click per object in the first frame.
arXiv Detail & Related papers (2024-03-10T08:37:37Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Video Object of Interest Segmentation [27.225312139360963]
We present a new computer vision task named video object of interest segmentation (VOIS)
Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image.
Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos.
arXiv Detail & Related papers (2022-12-06T10:21:10Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.