Flow-Guided Sparse Transformer for Video Deblurring
- URL: http://arxiv.org/abs/2201.01893v1
- Date: Thu, 6 Jan 2022 02:05:32 GMT
- Title: Flow-Guided Sparse Transformer for Video Deblurring
- Authors: Jing Lin, Yuanhao Cai, Xiaowan Hu, Haoqian Wang, Youliang Yan, Xueyi
Zou, Henghui Ding, Yulun Zhang, Radu Timofte, Luc Van Gool
- Abstract summary: FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
- Score: 124.11022871999423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploiting similar and sharper scene patches in spatio-temporal neighborhoods
is critical for video deblurring. However, CNN-based methods show limitations
in capturing long-range dependencies and modeling non-local self-similarity. In
this paper, we propose a novel framework, Flow-Guided Sparse Transformer
(FGST), for video deblurring. In FGST, we customize a self-attention module,
Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA). For each
$query$ element on the blurry reference frame, FGSW-MSA enjoys the guidance of
the estimated optical flow to globally sample spatially sparse yet highly
related $key$ elements corresponding to the same scene patch in neighboring
frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer
information from past frames and strengthen long-range temporal dependencies.
Comprehensive experiments demonstrate that our proposed FGST outperforms
state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and even yields
more visually pleasing results in real video deblurring. Code and models will
be released to the public.
Related papers
- SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition [18.542942459854867]
Large amounts of video samples are continuously required for traditional data-driven research.
We propose a novel plug-and-play architecture for action recognition called Stemp-Oral frAme tuwenle (SOAP) in this paper.
SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and SOAP51.
arXiv Detail & Related papers (2024-07-23T09:45:25Z) - Local Compressed Video Stream Learning for Generic Event Boundary
Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Aggregating Long-term Sharp Features via Hybrid Transformers for Video
Deblurring [76.54162653678871]
We propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation.
Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.