VA-RED$^2$: Video Adaptive Redundancy Reduction
- URL: http://arxiv.org/abs/2102.07887v1
- Date: Mon, 15 Feb 2021 22:57:52 GMT
- Title: VA-RED$^2$: Video Adaptive Redundancy Reduction
- Authors: Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex
Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris
- Abstract summary: We present a redundancy reduction framework, VA-RED$2$, which is input-dependent.
We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism.
Our framework achieves $20% - 40%$ reduction in computation (FLOPs) when compared to state-of-the-art methods.
- Score: 64.75692128294175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performing inference on deep learning models for videos remains a challenge
due to the large amount of computational resources required to achieve robust
recognition. An inherent property of real-world videos is the high correlation
of information across frames which can translate into redundancy in either
temporal or spatial feature maps of the models, or both. The type of redundant
features depends on the dynamics and type of events in the video: static videos
have more temporal redundancy while videos focusing on objects tend to have
more channel redundancy. Here we present a redundancy reduction framework,
termed VA-RED$^2$, which is input-dependent. Specifically, our VA-RED$^2$
framework uses an input-dependent policy to decide how many features need to be
computed for temporal and channel dimensions. To keep the capacity of the
original model, after fully computing the necessary features, we reconstruct
the remaining redundant features from those using cheap linear operations. We
learn the adaptive policy jointly with the network weights in a differentiable
way with a shared-weight mechanism, making it highly efficient. Extensive
experiments on multiple video datasets and different visual tasks show that our
framework achieves $20\% - 40\%$ reduction in computation (FLOPs) when compared
to state-of-the-art methods without any performance loss. Project page:
http://people.csail.mit.edu/bpan/va-red/.
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design [18.57172631588624]
We propose a Dynamic Deep neural network assisted by a Content-Aware data processing pipeline to reduce the model number down to one.
Our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone.
arXiv Detail & Related papers (2024-07-03T05:17:26Z) - $R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding [41.69321731689751]
Video temporal grounding aims to ground relevant clips in untrimmed videos given natural language queries.
Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones.
We propose Reversed Recurrent Tuning ($R2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding.
arXiv Detail & Related papers (2024-03-31T21:17:48Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Towards Scalable Neural Representation for Diverse Videos [68.73612099741956]
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images.
Existing INR-based methods are limited to encoding a handful of short videos with redundant visual content.
This paper focuses on developing neural representations for encoding long and/or a large number of videos with diverse visual content.
arXiv Detail & Related papers (2023-03-24T16:32:19Z) - A Codec Information Assisted Framework for Efficient Compressed Video
Super-Resolution [15.690562510147766]
Video Super-Resolution (VSR) using recurrent neural network architecture is a promising solution due to its efficient modeling of long-range temporal dependencies.
We propose a Codec Information Assisted Framework (CIAF) to boost and accelerate recurrent VSR models for compressed videos.
arXiv Detail & Related papers (2022-10-15T08:48:29Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Recurrent Video Restoration Transformer with Guided Deformable Attention [116.1684355529431]
We propose RVRT, which processes local neighboring frames in parallel within a globally recurrent framework.
RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.
arXiv Detail & Related papers (2022-06-05T10:36:09Z) - TAda! Temporally-Adaptive Convolutions for Video Understanding [17.24510667917993]
adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos.
TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context.
We construct TAda2D networks by replacing spatial convolutions in ResNet with TAdaConv, which leads to on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks.
arXiv Detail & Related papers (2021-10-12T17:25:07Z) - Skip-Convolutions for Efficient Video Processing [21.823332885657784]
Skip-Convolutions leverage the large amount of redundancies in video streams and save computations.
We replace all convolutions with Skip-Convolutions in two state-of-the-art architectures, namely EfficientDet and HRNet.
We reduce their computational cost consistently by a factor of 34x for two different tasks, without any accuracy drop.
arXiv Detail & Related papers (2021-04-23T09:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.