Related papers: Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring

Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring

URL: http://arxiv.org/abs/2309.07054v1
Date: Wed, 13 Sep 2023 16:12:11 GMT
Title: Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring
Authors: Dongwei Ren, Wei Shang, Yi Yang and Wangmeng Zuo
Abstract summary: We propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation. Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
Score: 76.54162653678871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world blurry videos taken by modern imaging devices, sharp frames usually appear in the given video, thus making temporal long-term sharp features available for facilitating the restoration of a blurry frame. In this work, we propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate long-term sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at https://github.com/shangwei5/STGTN.

Related papers

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring [44.30048301161034]
Video deblurring aims to enhance the quality of restored results in motion-red videos by gathering information from adjacent video frames. We propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, and 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets.
arXiv Detail & Related papers (2024-08-27T10:09:17Z)
Burstormer: Burst Image Restoration and Enhancement Transformer [117.56199661345993]
On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. We propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement.
arXiv Detail & Related papers (2023-04-03T17:58:44Z)
SATVSR: Scenario Adaptive Transformer for Cross Scenarios Video Super-Resolution [0.0]
Video Super-Resolution aims to recover sequences of high-resolution (HR) frames from low-resolution (LR) frames. Previous methods mainly utilize temporally adjacent frames to assist the reconstruction of target frames. We devise a novel adaptive scenario video super-resolution method. Specifically, we use optical flow to label the patches in each video frame, only calculate the attention of patches with the same label. Then select the most relevant label among them to supplement the spatial-temporal information of the target frame.
arXiv Detail & Related papers (2022-11-16T06:30:13Z)
E-VFIA : Event-Based Video Frame Interpolation with Attention [8.93294761619288]
We propose an event-based video frame with attention (E-VFIA) as a lightweight kernel-based method. E-VFIA fuses event information with standard video frames by deformable convolutions to generate high quality interpolated frames. The proposed method represents events with high temporal resolution and uses a multi-head self-attention mechanism to better encode event-based information.
arXiv Detail & Related papers (2022-09-19T21:40:32Z)
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames. We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI) Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z)
VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z)
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning. Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
Memory-Augmented Non-Local Attention for Video Super-Resolution [61.55700315062226]
We propose a novel video super-resolution method that aims at generating high-fidelity high-resolution (HR) videos from low-resolution (LR) ones. Previous methods predominantly leverage temporal neighbor frames to assist the super-resolution of the current frame. In contrast, we devise a cross-frame non-local attention mechanism that allows video super-resolution without frame alignment.
arXiv Detail & Related papers (2021-08-25T05:12:14Z)
ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions. We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space. Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z)
Motion-blurred Video Interpolation and Extrapolation [72.3254384191509]
We present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner. To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule.
arXiv Detail & Related papers (2021-03-04T12:18:25Z)
ALANET: Adaptive Latent Attention Network forJoint Video Deblurring and Interpolation [38.52446103418748]
We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos. We employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame. Our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem.
arXiv Detail & Related papers (2020-08-31T21:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.