AutoVideo: An Automated Video Action Recognition System
- URL: http://arxiv.org/abs/2108.04212v2
- Date: Tue, 10 Aug 2021 00:49:59 GMT
- Title: AutoVideo: An Automated Video Action Recognition System
- Authors: Daochen Zha, Zaid Pervaiz Bhat, Yi-Wei Chen, Yicheng Wang, Sirui Ding,
Anmoll Kumar Jain, Mohammad Qazim Bhat, Kwei-Herng Lai, Jiaben Chen, Na Zou,
Xia Hu
- Abstract summary: AutoVideo is a Python system for automated video action recognition.
It supports seven action recognition algorithms and various pre-processing modules.
It can be easily combined with AutoML searchers.
- Score: 38.39389521973502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition is a crucial task for video understanding. In this paper,
we present AutoVideo, a Python system for automated video action recognition.
It currently supports seven action recognition algorithms and various
pre-processing modules. Unlike the existing libraries that only provide model
zoos, AutoVideo is built with the standard pipeline language. The basic
building block is primitive, which wraps a pre-processing module or an
algorithm with some hyperparameters. AutoVideo is highly modular and
extendable. It can be easily combined with AutoML searchers. The pipeline
language is quite general so that we can easily enrich AutoVideo with
algorithms for various other video-related tasks in the future. AutoVideo is
released under MIT license at https://github.com/datamllab/autovideo
Related papers
- Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs [20.168429351519055]
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
VideoNIAH decouples test video content from query-responses by inserting unrelated image/text 'needles' into original videos.
It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Step Differences in Instructional Video [34.551572600535565]
We propose an approach that generates visual instruction tuning data involving pairs of videos from HowTo100M.
We then trains a video-conditioned language model to jointly reason across multiple raw videos.
Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos.
arXiv Detail & Related papers (2024-04-24T21:49:59Z) - Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - PyTorchVideo: A Deep Learning Library for Video Understanding [71.89124881732015]
PyTorchVideo is an open-source deep-learning library for video understanding tasks.
It covers a full stack of video understanding tools including multimodal data loading, transformations, and models.
The library is based on PyTorch and can be used by any training framework.
arXiv Detail & Related papers (2021-11-18T18:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.