EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations
- URL: http://arxiv.org/abs/2209.13064v1
- Date: Mon, 26 Sep 2022 23:03:26 GMT
- Title: EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations
- Authors: Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard
Higgins, Sanja Fidler, David Fouhey, Dima Damen
- Abstract summary: We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video.
Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions.
VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality.
- Score: 83.26326325568208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce VISOR, a new dataset of pixel annotations and a benchmark suite
for segmenting hands and active objects in egocentric video. VISOR annotates
videos from EPIC-KITCHENS, which comes with a new set of challenges not
encountered in current video segmentation datasets. Specifically, we need to
ensure both short- and long-term consistency of pixel-level annotations as
objects undergo transformative interactions, e.g. an onion is peeled, diced and
cooked - where we aim to obtain accurate pixel-level annotations of the peel,
onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR
introduces an annotation pipeline, AI-powered in parts, for scalability and
quality. In total, we publicly release 272K manual semantic masks of 257 object
classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36
hours of 179 untrimmed videos. Along with the annotations, we introduce three
challenges in video object segmentation, interaction understanding and
long-term reasoning.
For data, code and leaderboards: http://epic-kitchens.github.io/VISOR
Related papers
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Point-VOS: Pointing Up Video Object Segmentation [16.359861197595986]
Current state-of-the-art Video Object (VOS) methods rely on dense per-object mask annotations both during training and testing.
We propose a novel Point-VOS task with a sparse-temporally point-wise annotation scheme that substantially reduces the effort.
We show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task.
arXiv Detail & Related papers (2024-02-08T18:52:23Z) - Sketch-based Video Object Segmentation: Benchmark and Analysis [55.79497833614397]
This paper introduces a new task of sketch-based video object segmentation, an associated benchmark, and a strong baseline.
Our benchmark includes three datasets, Sketch-DAVIS16, Sketch-DAVIS17 and Sketch-YouTube-VOS, which exploit human-drawn sketches as an informative yet low-cost reference for video object segmentation.
Experimental results show sketch is more effective yet annotation-efficient than other references, such as photo masks, language and scribble.
arXiv Detail & Related papers (2023-11-13T11:53:49Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - EPIC Fields: Marrying 3D Geometry and Video Understanding [76.60638761589065]
EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information.
It removes the complex and expensive step of reconstructing cameras using photogrammetry.
It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
arXiv Detail & Related papers (2023-06-14T20:33:49Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z) - VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background.
Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.