Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer
- URL: http://arxiv.org/abs/2205.00214v1
- Date: Sat, 30 Apr 2022 09:01:21 GMT
- Title: Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer
- Authors: Wulian Yun, Mengshi Qi, Chuanming Wang, Huiyuan Fu, Huadong Ma
- Abstract summary: Video denoising aims to recover high-quality frames from the noisy video.
Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content.
We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
- Score: 29.03463312813923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video denoising aims to recover high-quality frames from the noisy video.
While most existing approaches adopt convolutional neural networks(CNNs) to
separate the noise from the original visual content, however, CNNs focus on
local information and ignore the interactions between long-range regions.
Furthermore, most related works directly take the output after spatio-temporal
denoising as the final result, neglecting the fine-grained denoising process.
In this paper, we propose a Dual-stage Spatial-Channel Transformer (DSCT) for
coarse-to-fine video denoising, which inherits the advantages of both
Transformer and CNNs. Specifically, DSCT is proposed based on a progressive
dual-stage architecture, namely a coarse-level and a fine-level to extract
dynamic feature and static feature, respectively. At both stages, a
Spatial-Channel Encoding Module(SCEM) is designed to model the long-range
contextual dependencies at spatial and channel levels. Meanwhile, we design a
Multi-scale Residual Structure to preserve multiple aspects of information at
different stages, which contains a Temporal Features Aggregation Module(TFAM)
to summarize the dynamic representation. Extensive experiments on four publicly
available datasets demonstrate our proposed DSCT achieves significant
improvements compared to the state-of-the-art methods.
Related papers
- IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation [136.5813547244979]
We present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation.
Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation.
Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields.
arXiv Detail & Related papers (2024-07-15T17:36:54Z) - Two-stage Progressive Residual Dense Attention Network for Image
Denoising [0.680228754562676]
Many deep CNN-based denoising models equally utilize the hierarchical features of noisy images without paying attention to the more important and useful features, leading to relatively low performance.
We design a new Two-stage Progressive Residual Attention Network (TSP-RDANet) for image denoising, which divides the whole process of denoising into two sub-tasks to remove noise progressively.
Two different attention mechanism-based denoising networks are designed for the two sequential sub-tasks.
arXiv Detail & Related papers (2024-01-05T14:31:20Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - Multi-stage image denoising with the wavelet transform [125.2251438120701]
Deep convolutional neural networks (CNNs) are used for image denoising via automatically mining accurate structure information.
We propose a multi-stage image denoising CNN with the wavelet transform (MWDCNN) via three stages, i.e., a dynamic convolutional block (DCB), two cascaded wavelet transform and enhancement blocks (WEBs) and residual block (RB)
arXiv Detail & Related papers (2022-09-26T03:28:23Z) - Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS
Instance Segmentation [11.575821326313607]
We propose Video-TransUNet, a deep architecture for segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework.
In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconal architecture with multiple heads.
arXiv Detail & Related papers (2022-08-17T14:28:58Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.