Related papers: Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

URL: http://arxiv.org/abs/2103.15233v2
Date: Tue, 30 Mar 2021 13:21:37 GMT
Title: Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization
Authors: Mengmeng Xu, Juan-Manuel Perez-Rua, Xiatian Zhu, Bernard Ghanem, Brais Martinez
Abstract summary: TAL is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
Score: 96.73647162960842
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin.

Related papers

Breaking the Encoder Barrier for Seamless Video-Language Understanding [22.749949819082484]
We propose ELVA, an encoder-free-LLM that directly models nuanced video-language interactions without relying on a vision encoder. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95% and inference latency by 92%.
arXiv Detail & Related papers (2025-03-24T08:06:39Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model. We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch. Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression [68.31184784672227]
In modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks. It is therefore useful to optimize the encoder for a downstream task instead of for image quality. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task.
arXiv Detail & Related papers (2025-01-21T15:36:08Z)
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z)
SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation. SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Accelerating Learnt Video Codecs with Gradient Decay and Layer-wise Distillation [17.980800481385195]
We present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR.
arXiv Detail & Related papers (2023-12-05T09:26:09Z)
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision. We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder. We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z)
Efficient Meta-Tuning for Content-aware Neural Video Delivery [40.3731358963689]
We present Efficient Meta-Tuning (EMT) to reduce the computational cost. EMT adapts a meta-learned model to the first chunk of the input video. We propose a novel sampling strategy to extract the most challenging patches from video frames.
arXiv Detail & Related papers (2022-07-20T06:47:10Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
Self-Conditioned Probabilistic Learning of Video Rescaling [70.10092286301997]
We propose a self-conditioned probabilistic framework for video rescaling to learn the paired downscaling and upscaling procedures simultaneously. We decrease the entropy of the information lost in the downscaling by maximizing its conditioned probability on the strong spatial-temporal prior information. We extend the framework to a lossy video compression system, in which a gradient estimator for non-differential industrial lossy codecs is proposed.
arXiv Detail & Related papers (2021-07-24T15:57:15Z)
Ultra-low bitrate video conferencing using deep image animation [7.263312285502382]
We propose a novel deep learning approach for ultra-low video compression for video conferencing applications. We employ deep neural networks to encode motion information as keypoint displacement and reconstruct the video signal at the decoder side.
arXiv Detail & Related papers (2020-12-01T09:06:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.