Related papers: Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

URL: http://arxiv.org/abs/2410.14993v1
Date: Sat, 19 Oct 2024 05:50:00 GMT
Title: Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling
Authors: Hao Wu, Donglin Bai, Shiqi Jiang, Qianxi Zhang, Yifan Yang, Ting Cao, Fengyuan Xu,
Abstract summary: Video understanding has become increasingly important with the rise of multi-modality applications. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information. The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
Score: 14.450847211200292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video understanding has become increasingly important with the rise of multi-modality applications. Understanding continuous video poses considerable challenges due to the fast expansion of streaming video, which contains multi-scale and untrimmed events. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information. The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations. The third is a parallel training structure that incorporates the frame-weighted loss to understand multi-scale events in long videos. Our C-VUE offers high accuracy and efficiency. It runs at speeds >30 FPS on typical edge devices and outperforms all baselines in accuracy. Moreover, applying C-VUE to a video foundation model as a video encoder in our case study resulted in a 0.46-point enhancement (on a 5-point scale) on the in-distribution dataset, and an improvement ranging from 1.19\% to 4\% on zero-shot datasets.

Related papers

Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization [3.145491389580349]
We introduce the Text-Motion Cross-modal Contrastive Loss (TMCCL), a multimodal video memorability prediction model.<n>We tackle the challenge of improving motion feature representation by leveraging text description similarities across videos.<n>Our model achieves state-of-the-art performance on two video memorability prediction datasets.
arXiv Detail & Related papers (2025-06-10T10:01:08Z)
Depth Any Video with Scalable Synthetic Data [98.42356740981839]
We develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments. We leverage the powerful priors of generative video diffusion models to handle real-world videos effectively. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.
arXiv Detail & Related papers (2024-10-14T17:59:46Z)
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing. First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder. Second, we present MotionAura, a text-to-video generation framework. Third, we propose a spectral transformer-based denoising network. Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z)
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos [13.368981834953981]
We propose Fr'echet Video Motion Distance metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr'echet distance. We carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
arXiv Detail & Related papers (2024-07-23T02:10:50Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
GTAutoAct: An Automatic Datasets Generation Framework Based on Game Engine Redevelopment for Action Recognition [12.521014978532548]
GTAutoAct is a novel dataset generation framework leveraging game engine technology to facilitate advancements in action recognition. It transforms coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints. It implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities.
arXiv Detail & Related papers (2024-01-24T12:18:31Z)
TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings. We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges. Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge. We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z)
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training [11.746833714322154]
We introduce a new concept, Dynamic Appearance (DA), summarizing the appearance information relating to movement in a video. We consider distilling the dynamic appearance from raw video data as a means of efficient video understanding. We provide extensive experimental results on four action recognition benchmarks.
arXiv Detail & Related papers (2022-11-23T07:16:16Z)
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks. Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
Argus++: Robust Real-time Activity Detection for Unconstrained Video Streams with Overlapping Cube Proposals [85.76513755331318]
Argus++ is a robust real-time activity detection system for analyzing unconstrained video streams. The overall system is optimized for real-time processing on standalone consumer-level hardware.
arXiv Detail & Related papers (2022-01-14T03:35:22Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only. It is difficult to construct a suitable self-supervised task to well model both motion and appearance features. We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos. What-Where-When (W3) video attention module models all three facets of video attention jointly. Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.