CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow
- URL: http://arxiv.org/abs/2203.16896v1
- Date: Thu, 31 Mar 2022 09:05:00 GMT
- Title: CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow
- Authors: Xiuchao Sui, Shaohua Li, Xue Geng, Yan Wu, Xinxing Xu, Yong Liu, Rick
Goh, Hongyuan Zhu
- Abstract summary: Optical flow estimation aims to find the 2D motion field by identifying corresponding pixels between two images.
Despite the tremendous progress of deep learning-based optical flow methods, it remains a challenge to accurately estimate large displacements with motion blur.
This is mainly because the correlation volume, the basis of pixel matching, is computed as the dot product of the convolutional features of the two images.
We propose a new architecture "CRoss-Attentional Flow Transformer" (CRAFT) to revitalize the correlation volume computation.
- Score: 23.457898451057275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical flow estimation aims to find the 2D motion field by identifying
corresponding pixels between two images. Despite the tremendous progress of
deep learning-based optical flow methods, it remains a challenge to accurately
estimate large displacements with motion blur. This is mainly because the
correlation volume, the basis of pixel matching, is computed as the dot product
of the convolutional features of the two images. The locality of convolutional
features makes the computed correlations susceptible to various noises. On
large displacements with motion blur, noisy correlations could cause severe
errors in the estimated flow. To overcome this challenge, we propose a new
architecture "CRoss-Attentional Flow Transformer" (CRAFT), aiming to revitalize
the correlation volume computation. In CRAFT, a Semantic Smoothing Transformer
layer transforms the features of one frame, making them more global and
semantically stable. In addition, the dot-product correlations are replaced
with transformer Cross-Frame Attention. This layer filters out feature noises
through the Query and Key projections, and computes more accurate correlations.
On Sintel (Final) and KITTI (foreground) benchmarks, CRAFT has achieved new
state-of-the-art performance. Moreover, to test the robustness of different
models on large motions, we designed an image shifting attack that shifts input
images to generate large artificial motions. Under this attack, CRAFT performs
much more robustly than two representative methods, RAFT and GMA. The code of
CRAFT is is available at https://github.com/askerlee/craft.
Related papers
- ML-CrAIST: Multi-scale Low-high Frequency Information-based Cross black Attention with Image Super-resolving Transformer [3.686808512438363]
This work proposes a transformer-based super-resolution architecture called ML-CrAIST.
We operate spatial and channel self-attention, which concurrently model pixel interaction from both spatial and channel dimensions.
We devise a cross-attention block for super-resolution, which explores the correlations between low and high-frequency information.
arXiv Detail & Related papers (2024-08-19T12:23:15Z) - WiNet: Wavelet-based Incremental Learning for Efficient Medical Image Registration [68.25711405944239]
Deep image registration has demonstrated exceptional accuracy and fast inference.
Recent advances have adopted either multiple cascades or pyramid architectures to estimate dense deformation fields in a coarse-to-fine manner.
We introduce a model-driven WiNet that incrementally estimates scale-wise wavelet coefficients for the displacement/velocity field across various scales.
arXiv Detail & Related papers (2024-07-18T11:51:01Z) - Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring [71.60457491155451]
Eliminating image blur produced by various kinds of motion has been a challenging problem.
We propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative Filter.
Our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-19T19:44:24Z) - Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration [46.96362010335177]
In this paper, we propose HIT, a simple yet effective High-frequency Injected Transformer for image restoration.
Specifically, we design a window-wise injection module (WIM), which incorporates abundant high-frequency details into the feature map, to provide reliable references for restoring high-quality images.
In addition, we introduce a spatial enhancement unit (SEU) to preserve essential spatial relationships that may be lost due to the computations carried out across channel dimensions in the BIM.
arXiv Detail & Related papers (2024-03-30T08:05:00Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - CGCV:Context Guided Correlation Volume for Optical Flow Neural Networks [1.9226937205270165]
Correlation volume is the central component of optical flow computational neural models.
We propose a new Context Guided Correlation Volume (CGCV) via gating and lifting schemes.
CGCV can be universally integrated with RAFT-based flow computation methods for enhanced performance.
arXiv Detail & Related papers (2022-12-20T11:24:35Z) - TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task.
We employ a restrictive CNN with small and non-overlapping RF for token representation.
In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.