Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding
- URL: http://arxiv.org/abs/2309.11569v1
- Date: Wed, 20 Sep 2023 18:13:32 GMT
- Title: Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding
- Authors: Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang,
Ashish Shah, Sernam Lim
- Abstract summary: Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
- Score: 57.917616284917756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While most modern video understanding models operate on short-range clips,
real-world videos are often several minutes long with semantically consistent
segments of variable length. A common approach to process long videos is
applying a short-form video model over uniformly sampled clips of fixed
temporal length and aggregating the outputs. This approach neglects the
underlying nature of long videos since fixed-length clips are often redundant
or uninformative. In this paper, we aim to provide a generic and adaptive
sampling approach for long-form videos in lieu of the de facto uniform
sampling. Viewing videos as semantically consistent segments, we formulate a
task-agnostic, unsupervised, and scalable approach based on Kernel Temporal
Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our
method on long-form video understanding tasks such as video classification and
temporal action localization, showing consistent gains over existing approaches
and achieving state-of-the-art performance on long-form video modeling.
Related papers
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation.
We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation.
Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Temporally stable video segmentation without video annotations [6.184270985214255]
We introduce a method to adapt still image segmentation models to video in an unsupervised manner.
We verify that the consistency measure is well correlated with human judgement via a user study.
We observe improvements in the generated segmented videos with minimal loss of accuracy.
arXiv Detail & Related papers (2021-10-17T18:59:11Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.