Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning
- URL: http://arxiv.org/abs/2402.06560v1
- Date: Fri, 9 Feb 2024 17:19:05 GMT
- Title: Video Annotator: A framework for efficiently building video classifiers
using vision-language models and active learning
- Authors: Amir Ziai, Aneesh Vartakavi
- Abstract summary: Video Annotator (VA) is a framework for annotating, managing, and iterating on video classification datasets.
VA allows for a continuous annotation process, seamlessly integrating data collection and model training.
VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality and consistent annotations are fundamental to the successful
development of robust machine learning models. Traditional data annotation
methods are resource-intensive and inefficient, often leading to a reliance on
third-party annotators who are not the domain experts. Hard samples, which are
usually the most informative for model training, tend to be difficult to label
accurately and consistently without business context. These can arise
unpredictably during the annotation process, requiring a variable number of
iterations and rounds of feedback, leading to unforeseen expenses and time
commitments to guarantee quality.
We posit that more direct involvement of domain experts, using a
human-in-the-loop system, can resolve many of these practical challenges. We
propose a novel framework we call Video Annotator (VA) for annotating,
managing, and iterating on video classification datasets. Our approach offers a
new paradigm for an end-user-centered model development process, enhancing the
efficiency, usability, and effectiveness of video classifiers. Uniquely, VA
allows for a continuous annotation process, seamlessly integrating data
collection and model training.
We leverage the zero-shot capabilities of vision-language foundation models
combined with active learning techniques, and demonstrate that VA enables the
efficient creation of high-quality models. VA achieves a median 6.8 point
improvement in Average Precision relative to the most competitive baseline
across a wide-ranging assortment of tasks. We release a dataset with 153k
labels across 56 video understanding tasks annotated by three professional
video editors using VA, and also release code to replicate our experiments at:
http://github.com/netflix/videoannotator.
Related papers
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos.
Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Distill on the Go: Online knowledge distillation in self-supervised
learning [1.1470070927586016]
Recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models.
We propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation.
Our results show significant performance gain in the presence of noisy and limited labels.
arXiv Detail & Related papers (2021-04-20T09:59:23Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.