VideoLT: Large-scale Long-tailed Video Recognition
- URL: http://arxiv.org/abs/2105.02668v1
- Date: Thu, 6 May 2021 13:47:44 GMT
- Title: VideoLT: Large-scale Long-tailed Video Recognition
- Authors: Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang
Jiang, Larry Davis
- Abstract summary: We introduce VideoLT, a large-scale long-tailed video recognition dataset.
Our VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution.
We propose FrameStack, a simple yet effective method for long-tailed video recognition task.
- Score: 100.15503884988736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Label distributions in real-world are oftentimes long-tailed and imbalanced,
resulting in biased models towards dominant labels. While long-tailed
recognition has been extensively studied for image classification tasks,
limited effort has been made for video domain. In this paper, we introduce
VideoLT, a large-scale long-tailed video recognition dataset, as a step toward
real-world video recognition. Our VideoLT contains 256,218 untrimmed videos,
annotated into 1,004 classes with a long-tailed distribution. Through extensive
studies, we demonstrate that state-of-the-art methods used for long-tailed
image recognition do not perform well in the video domain due to the additional
temporal dimension in video data. This motivates us to propose FrameStack, a
simple yet effective method for long-tailed video recognition task. In
particular, FrameStack performs sampling at the frame-level in order to balance
class distributions, and the sampling ratio is dynamically determined using
knowledge derived from the network during training. Experimental results
demonstrate that FrameStack can improve classification performance without
sacrificing overall accuracy.
Related papers
- VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - Active Learning for Video Classification with Frame Level Queries [13.135234328352885]
We propose a novel active learning framework for video classification.
Our framework identifies a batch of exemplar videos, together with a set of informative frames for each video.
This involves much less manual work than watching the complete video to come up with a label.
arXiv Detail & Related papers (2023-07-10T15:47:13Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.