Boosting Video Representation Learning with Multi-Faceted Integration
- URL: http://arxiv.org/abs/2201.04023v1
- Date: Tue, 11 Jan 2022 16:14:23 GMT
- Title: Boosting Video Representation Learning with Multi-Faceted Integration
- Authors: Zhaofan Qiu and Ting Yao and Chong-Wah Ngo and Xiao-Ping Zhang and
Dong Wu and Tao Mei
- Abstract summary: Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
- Score: 112.66127428372089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video content is multifaceted, consisting of objects, scenes, interactions or
actions. The existing datasets mostly label only one of the facets for model
training, resulting in the video representation that biases to only one facet
depending on the training dataset. There is no study yet on how to learn a
video representation from multifaceted labels, and whether multifaceted
information is helpful for video representation learning. In this paper, we
propose a new learning framework, MUlti-Faceted Integration (MUFI), to
aggregate facets from different datasets for learning a representation that
could reflect the full spectrum of video content. Technically, MUFI formulates
the problem as visual-semantic embedding learning, which explicitly maps video
representation into a rich semantic embedding space, and jointly optimizes
video representation from two perspectives. One is to capitalize on the
intra-facet supervision between each video and its own label descriptions, and
the second predicts the "semantic representation" of each video from the facets
of other datasets as the inter-facet supervision. Extensive experiments
demonstrate that learning 3D CNN via our MUFI framework on a union of four
large-scale video datasets plus two image datasets leads to superior capability
of video representation. The pre-learnt 3D CNN with MUFI also shows clear
improvements over other approaches on several downstream video applications.
More remarkably, MUFI achieves 98.1%/80.9% on UCF101/HMDB51 for action
recognition and 101.5% in terms of CIDEr-D score on MSVD for video captioning.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Learning from Semantic Alignment between Unpaired Multiviews for
Egocentric Video Recognition [23.031934558964473]
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem.
Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos.
Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
arXiv Detail & Related papers (2023-08-22T15:10:42Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.