Towards Long-Form Video Understanding
- URL: http://arxiv.org/abs/2106.11310v1
- Date: Mon, 21 Jun 2021 17:59:52 GMT
- Title: Towards Long-Form Video Understanding
- Authors: Chao-Yuan Wu, Philipp Kr\"ahenb\"uhl
- Abstract summary: We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets.
A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks.
- Score: 7.962725903399016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our world offers a never-ending stream of visual stimuli, yet today's vision
systems only accurately recognize patterns within a few seconds. These systems
understand the present, but fail to contextualize it in past or future events.
In this paper, we study long-form video understanding. We introduce a framework
for modeling long-form videos and develop evaluation protocols on large-scale
datasets. We show that existing state-of-the-art short-term models are limited
for long-form tasks. A novel object-centric transformer-based video recognition
architecture performs significantly better on 7 diverse tasks. It also
outperforms comparable state-of-the-art on the AVA dataset.
Related papers
- From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding [52.696422425058245]
MultiModal Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks.
Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding.
arXiv Detail & Related papers (2024-09-27T17:38:36Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - MUSTAN: Multi-scale Temporal Context as Attention for Robust Video
Foreground Segmentation [2.2232550112727267]
Video foreground segmentation (VFS) is an important computer vision task wherein one aims to segment the objects under motion from the background.
Most of the current methods are image-based, i.e., rely only on spatial cues while ignoring motion cues.
In this paper, we utilize the temporal information and the spatial cues from the video data to improve OOD performance.
arXiv Detail & Related papers (2024-02-01T13:47:23Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.