Related papers: Efficient Convolution and Transformer-Based Network for Video Frame Interpolation

Efficient Convolution and Transformer-Based Network for Video Frame Interpolation

URL: http://arxiv.org/abs/2307.06443v1
Date: Wed, 12 Jul 2023 20:14:06 GMT
Title: Efficient Convolution and Transformer-Based Network for Video Frame Interpolation
Authors: Issa Khalifeh, Luka Murn, Marta Mrak and Ebroul Izquierdo
Abstract summary: A novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies.
Score: 11.036815066639473
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.

Related papers

FullTransNet: Full Transformer with Local-Global Attention for Video Summarization [16.134118247239527]
We propose a transformer-like architecture named FullTransNet for video summarization.<n>It uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization.<n>Our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements.
arXiv Detail & Related papers (2025-01-01T16:07:27Z)
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z)
Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding [12.595019348741042]
We propose a transformer-based video saliency prediction approach with high temporal dimension network decoding (THTDNet) This architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
arXiv Detail & Related papers (2024-01-15T20:09:56Z)
AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT) The proposed ICT can capture both global and local contexts from the latent representations. We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
End-to-end Transformer for Compressed Video Quality Enhancement [21.967066471073462]
We propose a transformer-based compressed video quality enhancement (TVQE) method, consisting of Swin-AutoEncoder based Spatio-Temporal feature Fusion (SSTF) module and Channel-wise Attention based Quality Enhancement (CAQE) module. Our proposed method outperforms existing ones in terms of both inference speed and GPU consumption.
arXiv Detail & Related papers (2022-10-25T08:12:05Z)
Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames. Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.