Cross-Attention Transformer for Video Interpolation
- URL: http://arxiv.org/abs/2207.04132v1
- Date: Fri, 8 Jul 2022 21:38:54 GMT
- Title: Cross-Attention Transformer for Video Interpolation
- Authors: Hannah Halin Kim, Shuzhi Yu, Shuai Yuan, Carlo Tomasi
- Abstract summary: TAIN (Transformers and Attention for video INterpolation) aims to interpolate an intermediate frame given two consecutive image frames around it.
We first present a novel visual transformer module, named Cross-Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted frame.
To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other.
- Score: 3.5317804902980527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose TAIN (Transformers and Attention for video INterpolation), a
residual neural network for video interpolation, which aims to interpolate an
intermediate frame given two consecutive image frames around it. We first
present a novel visual transformer module, named Cross-Similarity (CS), to
globally aggregate input image features with similar appearance as those of the
predicted interpolated frame. These CS features are then used to refine the
interpolated prediction. To account for occlusions in the CS features, we
propose an Image Attention (IA) module to allow the network to focus on CS
features from one frame over those of the other. Additionally, we augment our
training dataset with an occluder patch that moves across frames to improve the
network's robustness to occlusions and large motion. Because existing methods
yield smooth predictions especially near MBs, we use an additional training
loss based on image gradient to yield sharper predictions. TAIN outperforms
existing methods that do not require flow estimation and performs comparably to
flow-based methods while being computationally efficient in terms of inference
time on Vimeo90k, UCF101, and SNU-FILM benchmarks.
Related papers
- ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - IDO-VFI: Identifying Dynamics via Optical Flow Guidance for Video Frame
Interpolation with Events [14.098949778274733]
Event cameras are ideal for capturing inter-frame dynamics with their extremely high temporal resolution.
We propose an event-and-frame-based video frame method named IDO-VFI that assigns varying amounts of computation for different sub-regions.
Our proposed method maintains high-quality performance while reducing computation time and computational effort by 10% and 17% respectively on Vimeo90K datasets.
arXiv Detail & Related papers (2023-05-17T13:22:21Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - RAI-Net: Range-Adaptive LiDAR Point Cloud Frame Interpolation Network [5.225160072036824]
LiDAR point cloud frame, which synthesizes the intermediate frame between the captured frames, has emerged as an important issue for many applications.
We propose a novel LiDAR point cloud optical frame method, which exploits range images (RIs) as an intermediate representation with CNNs to conduct the frame process.
Our method consistently achieves superior frame results with better perceptual quality to that of using state-of-the-art video frame methods.
arXiv Detail & Related papers (2021-06-01T13:59:08Z) - EA-Net: Edge-Aware Network for Flow-based Video Frame Interpolation [101.75999290175412]
We propose to reduce the image blur and get the clear shape of objects by preserving the edges in the interpolated frames.
The proposed Edge-Aware Network (EANet) integrates the edge information into the frame task.
Three edge-aware mechanisms are developed to emphasize the frame edges in estimating flow maps.
arXiv Detail & Related papers (2021-05-17T08:44:34Z) - Frame-rate Up-conversion Detection Based on Convolutional Neural Network
for Learning Spatiotemporal Features [7.895528973776606]
This paper proposes a frame-rate conversion detection network (FCDNet) that learns forensic features caused by FRUC in an end-to-end fashion.
FCDNet uses a stack of consecutive frames as the input and effectively learns artifacts using network blocks to learn features.
arXiv Detail & Related papers (2021-03-25T08:47:46Z) - Deep Learning for Robust Motion Segmentation with Non-Static Cameras [0.0]
This paper proposes a new end-to-end DCNN based approach for motion segmentation, especially for captured with such non-static cameras, called MOSNET.
While other approaches focus on spatial or temporal context, the proposed approach uses 3D convolutions as a key technology to factor in temporal features in video frames.
The network is able to perform well on scenes captured with non-static cameras where the image content changes significantly during the scene.
arXiv Detail & Related papers (2021-02-22T11:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.