Rethinking Zero-shot Video Classification: End-to-end Training for
Realistic Applications
- URL: http://arxiv.org/abs/2003.01455v4
- Date: Sat, 20 Jun 2020 08:22:45 GMT
- Title: Rethinking Zero-shot Video Classification: End-to-end Training for
Realistic Applications
- Authors: Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof
Chalupka
- Abstract summary: Zero-shot learning (ZSL) trains a model once and generalizes to new tasks whose classes are not present in the training dataset.
We propose the first end-to-end algorithm for ZSL in video classification.
Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features.
- Score: 26.955001807330497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trained on large datasets, deep learning (DL) can accurately classify videos
into hundreds of diverse classes. However, video data is expensive to annotate.
Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a
model once, and generalizes to new tasks whose classes are not present in the
training dataset. We propose the first end-to-end algorithm for ZSL in video
classification. Our training procedure builds on insights from recent video
classification literature and uses a trainable 3D CNN to learn the visual
features. This is in contrast to previous video ZSL methods, which use
pretrained feature extractors. We also extend the current benchmarking
paradigm: Previous techniques aim to make the test task unknown at training
time but fall short of this goal. We encourage domain shift across training and
test data and disallow tailoring a ZSL model to a specific test dataset. We
outperform the state-of-the-art by a wide margin. Our code, evaluation
procedure and model weights are available at
github.com/bbrattoli/ZeroShotVideoClassification.
Related papers
- Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Self-Supervised Video Similarity Learning [35.512588398849395]
We introduce S$2$VS, a video similarity learning approach with self-supervision.
We learn a single universal model that achieves state-of-the-art performance on all tasks.
arXiv Detail & Related papers (2023-04-06T21:15:27Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Dynamic VAEs with Generative Replay for Continual Zero-shot Learning [1.90365714903665]
This paper proposes a novel continual zero-shot learning (DVGR-CZSL) model that grows in size with each task and uses generative replay to update itself with previously learned classes to avoid forgetting.
We show our method is superior in task sequentially learning with ZSL(Zero-Shot Learning)
arXiv Detail & Related papers (2021-04-26T10:56:43Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Generative Replay-based Continual Zero-Shot Learning [7.909034037183046]
We develop a generative replay-based continual ZSL (GRCZSL)
The proposed method endows traditional ZSL to learn from streaming data and acquire new knowledge without forgetting the previous tasks' experience.
The proposed GRZSL method is developed for a single-head setting of continual learning, simulating a real-world problem setting.
arXiv Detail & Related papers (2021-01-22T00:03:34Z) - Curriculum Learning for Recurrent Video Object Segmentation [2.3376061255029064]
This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture.
Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one.
arXiv Detail & Related papers (2020-08-15T10:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.