Beyond Instructional Videos: Probing for More Diverse Visual-Textual
Grounding on YouTube
- URL: http://arxiv.org/abs/2004.14338v2
- Date: Fri, 16 Oct 2020 17:30:51 GMT
- Title: Beyond Instructional Videos: Probing for More Diverse Visual-Textual
Grounding on YouTube
- Authors: Jack Hessel, Zhenhai Zhu, Bo Pang, Radu Soricut
- Abstract summary: We show that visual-textual grounding is possible across previously unexplored video categories.
We find that pretraining on a more diverse set results in representations that generalize to both non-instructional and instructional domains.
- Score: 35.32213834577941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining from unlabelled web videos has quickly become the de-facto means
of achieving high performance on many video understanding tasks. Features are
learned via prediction of grounded relationships between visual content and
automatic speech recognition (ASR) tokens. However, prior pretraining work has
been limited to only instructional videos; a priori, we expect this domain to
be relatively "easy:" speakers in instructional videos will often reference the
literal objects/actions being depicted. We ask: can similar models be trained
on more diverse video corpora? And, if so, what types of videos are "grounded"
and what types are not? We fit a representative pretraining model to the
diverse YouTube8M dataset, and study its success and failure cases. We find
that visual-textual grounding is indeed possible across previously unexplored
video categories, and that pretraining on a more diverse set results in
representations that generalize to both non-instructional and instructional
domains.
Related papers
- InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.