Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling
- URL: http://arxiv.org/abs/2102.06183v1
- Date: Thu, 11 Feb 2021 18:50:16 GMT
- Title: Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling
- Authors: Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit
Bansal, Jingjing Liu
- Abstract summary: A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
- Score: 98.41300980759577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The canonical approach to video-and-language learning (e.g., video question
answering) dictates a neural model to learn from offline-extracted dense video
features from vision models and text features from language models. These
feature extractors are trained independently and usually on tasks different
from the target domains, rendering these fixed features sub-optimal for
downstream tasks. Moreover, due to the high computational overload of dense
video features, it is often difficult (or infeasible) to plug feature
extractors directly into existing approaches for easy finetuning. To provide a
remedy to this dilemma, we propose a generic framework ClipBERT that enables
affordable end-to-end learning for video-and-language tasks, by employing
sparse sampling, where only a single or a few sparsely sampled short clips from
a video are used at each training step. Experiments on text-to-video retrieval
and video question answering on six datasets demonstrate that ClipBERT
outperforms (or is on par with) existing methods that exploit full-length
videos, suggesting that end-to-end learning with just a few sparsely sampled
clips is often more accurate than using densely extracted offline features from
full-length videos, proving the proverbial less-is-more principle. Videos in
the datasets are from considerably different domains and lengths, ranging from
3-second generic domain GIF videos to 180-second YouTube human activity videos,
showing the generalization ability of our approach. Comprehensive ablation
studies and thorough analyses are provided to dissect what factors lead to this
success. Our code is publicly available at https://github.com/jayleicn/ClipBERT
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Skimming and Scanning for Untrimmed Video Action Recognition [44.70501912319826]
Untrimmed videos have redundant and diverse clips containing contextual information.
We propose a simple yet effective clip-level solution based on skim-scan techniques.
Our solution surpasses the state-of-the-art performance in terms of both accuracy and efficiency.
arXiv Detail & Related papers (2021-04-21T12:23:44Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.