Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories
- URL: http://arxiv.org/abs/2104.01198v1
- Date: Fri, 2 Apr 2021 18:59:09 GMT
- Title: Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories
- Authors: Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis and Heng Wang
- Abstract summary: We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
- Score: 56.91664227337115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard way of training video models entails sampling at each iteration
a single clip from a video and optimizing the clip prediction with respect to
the video-level label. We argue that a single clip may not have enough temporal
coverage to exhibit the label to recognize, since video datasets are often
weakly labeled with categorical information but without dense temporal
annotations. Furthermore, optimizing the model over brief clips impedes its
ability to learn long-term temporal dependencies. To overcome these
limitations, we introduce a collaborative memory mechanism that encodes
information across multiple sampled clips of a video at each training
iteration. This enables the learning of long-range dependencies beyond a single
clip. We explore different design choices for the collaborative memory to ease
the optimization difficulties. Our proposed framework is end-to-end trainable
and significantly improves the accuracy of video classification at a negligible
computational overhead. Through extensive experiments, we demonstrate that our
framework generalizes to different video architectures and tasks, outperforming
the state of the art on both action recognition (e.g., Kinetics-400 & 700,
Charades, Something-Something-V1) and action detection (e.g., AVA v2.1 & v2.2).
Related papers
- VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Skimming and Scanning for Untrimmed Video Action Recognition [44.70501912319826]
Untrimmed videos have redundant and diverse clips containing contextual information.
We propose a simple yet effective clip-level solution based on skim-scan techniques.
Our solution surpasses the state-of-the-art performance in terms of both accuracy and efficiency.
arXiv Detail & Related papers (2021-04-21T12:23:44Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.