Few-Shot Video Object Detection
- URL: http://arxiv.org/abs/2104.14805v1
- Date: Fri, 30 Apr 2021 07:38:04 GMT
- Title: Few-Shot Video Object Detection
- Authors: Qi Fan, Chi-Keung Tang, Yu-Wing Tai
- Abstract summary: We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
- Score: 70.43402912344327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Few-Shot Video Object Detection (FSVOD) with three important
contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500
classes with class-balanced videos in each category for few-shot learning; 2) a
novel Tube Proposal Network (TPN) to generate high-quality video tube proposals
to aggregate feature representation for the target video object; 3) a
strategically improved Temporal Matching Network (TMN+) to match representative
query tube features and supports with better discriminative ability. Our TPN
and TMN+ are jointly and end-to-end trained. Extensive experiments demonstrate
that our method produces significantly better detection results on two few-shot
video object detection datasets compared to image-based methods and other naive
video-based extensions. Codes and datasets will be released at
https://github.com/fanq15/FewX.
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Towards Scalable Neural Representation for Diverse Videos [68.73612099741956]
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images.
Existing INR-based methods are limited to encoding a handful of short videos with redundant visual content.
This paper focuses on developing neural representations for encoding long and/or a large number of videos with diverse visual content.
arXiv Detail & Related papers (2023-03-24T16:32:19Z) - Class-attention Video Transformer for Engagement Intensity Prediction [20.430266245901684]
CavT is a method to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos.
CavT achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, and the state-of-the-art MSE (0.0377) on the DAiSEE dataset.
arXiv Detail & Related papers (2022-08-12T01:21:30Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Few-Shot Learning for Video Object Detection in a Transfer-Learning
Scheme [70.45901040613015]
We study the new problem of few-shot learning for video object detection.
We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects.
arXiv Detail & Related papers (2021-03-26T20:37:55Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.