Efficient video annotation with visual interpolation and frame selection
guidance
- URL: http://arxiv.org/abs/2012.12554v1
- Date: Wed, 23 Dec 2020 09:31:40 GMT
- Title: Efficient video annotation with visual interpolation and frame selection
guidance
- Authors: A. Kuznetsova, A. Talati, Y. Luo, K. Simmons and V. Ferrari
- Abstract summary: We introduce a unified framework for generic video annotation with bounding boxes.
We show that our approach reduces actual measured annotation time by 50% compared to commonly used linear methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a unified framework for generic video annotation with bounding
boxes. Video annotation is a longstanding problem, as it is a tedious and
time-consuming process. We tackle two important challenges of video annotation:
(1) automatic temporal interpolation and extrapolation of bounding boxes
provided by a human annotator on a subset of all frames, and (2) automatic
selection of frames to annotate manually. Our contribution is two-fold: first,
we propose a model that has both interpolating and extrapolating capabilities;
second, we propose a guiding mechanism that sequentially generates suggestions
for what frame to annotate next, based on the annotations made previously. We
extensively evaluate our approach on several challenging datasets in simulation
and demonstrate a reduction in terms of the number of manual bounding boxes
drawn by 60% over linear interpolation and by 35% over an off-the-shelf
tracker. Moreover, we also show 10% annotation time improvement over a
state-of-the-art method for video annotation with bounding boxes [25]. Finally,
we run human annotation experiments and provide extensive analysis of the
results, showing that our approach reduces actual measured annotation time by
50% compared to commonly used linear interpolation.
Related papers
- On-the-Fly Point Annotation for Fast Medical Video Labeling [1.890063512530524]
In medical research, deep learning models rely on high-quality annotated data.
The need to adjust two corners makes the process inherently frame-by-frame.
We propose an on-the-fly method for live video annotation to enhance the annotation efficiency.
arXiv Detail & Related papers (2024-04-22T16:59:43Z) - SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations [12.139451002212063]
SSVOD exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations.
Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS.
arXiv Detail & Related papers (2023-09-04T06:41:33Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - A Video Summarization Method Using Temporal Interest Detection and Key
Frame Prediction [3.9596068699962323]
Video summarization is formulated as a combination of sequence labeling and temporal interest detection problem.
In our method, we firstly built a flexible universal network frame to simultaneously predicts frame-level importance scores and temporal interest segments.
Tests and analysis on two benchmark datasets prove the effectiveness of our method.
arXiv Detail & Related papers (2021-09-26T12:15:18Z) - Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences.
A temporal assessment network is proposed which is able to capture the temporal coherence of target locations.
A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z) - Weakly Supervised Video Salient Object Detection [79.51227350937721]
We present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations"
An "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling.
arXiv Detail & Related papers (2021-04-06T09:48:38Z) - Motion-blurred Video Interpolation and Extrapolation [72.3254384191509]
We present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner.
To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule.
arXiv Detail & Related papers (2021-03-04T12:18:25Z) - ScribbleBox: Interactive Annotation Framework for Video Object
Segmentation [62.86341611684222]
We introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos.
Box tracks are annotated efficiently by approximating the trajectory using a parametric curve.
We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS 2017 with 9.14 clicks per box track, and 4 frames of annotation.
arXiv Detail & Related papers (2020-08-22T00:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.