Related papers: Efficient video annotation with visual interpolation and frame selection guidance

Efficient video annotation with visual interpolation and frame selection guidance

URL: http://arxiv.org/abs/2012.12554v1
Date: Wed, 23 Dec 2020 09:31:40 GMT
Title: Efficient video annotation with visual interpolation and frame selection guidance
Authors: A. Kuznetsova, A. Talati, Y. Luo, K. Simmons and V. Ferrari
Abstract summary: We introduce a unified framework for generic video annotation with bounding boxes. We show that our approach reduces actual measured annotation time by 50% compared to commonly used linear methods.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation.

Related papers

Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z)
On-the-Fly Point Annotation for Fast Medical Video Labeling [1.890063512530524]
In medical research, deep learning models rely on high-quality annotated data. The need to adjust two corners makes the process inherently frame-by-frame. We propose an on-the-fly method for live video annotation to enhance the annotation efficiency.
arXiv Detail & Related papers (2024-04-22T16:59:43Z)
SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations [12.139451002212063]
SSVOD exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS.
arXiv Detail & Related papers (2023-09-04T06:41:33Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences. Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions. Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z)
A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction [3.9596068699962323]
Video summarization is formulated as a combination of sequence labeling and temporal interest detection problem. In our method, we firstly built a flexible universal network frame to simultaneously predicts frame-level importance scores and temporal interest segments. Tests and analysis on two benchmark datasets prove the effectiveness of our method.
arXiv Detail & Related papers (2021-09-26T12:15:18Z)
Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences. A temporal assessment network is proposed which is able to capture the temporal coherence of target locations. A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z)
Weakly Supervised Video Salient Object Detection [79.51227350937721]
We present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations" An "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling.
arXiv Detail & Related papers (2021-04-06T09:48:38Z)
ScribbleBox: Interactive Annotation Framework for Video Object Segmentation [62.86341611684222]
We introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS 2017 with 9.14 clicks per box track, and 4 frames of annotation.
arXiv Detail & Related papers (2020-08-22T00:33:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.