Can SAM Boost Video Super-Resolution?
- URL: http://arxiv.org/abs/2305.06524v2
- Date: Fri, 12 May 2023 01:43:00 GMT
- Title: Can SAM Boost Video Super-Resolution?
- Authors: Zhihe Lu, Zeyu Xiao, Jiawang Bai, Zhiwei Xiong, Xinchao Wang
- Abstract summary: We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
- Score: 78.29033914169025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary challenge in video super-resolution (VSR) is to handle large
motions in the input frames, which makes it difficult to accurately aggregate
information from multiple frames. Existing works either adopt deformable
convolutions or estimate optical flow as a prior to establish correspondences
between frames for the effective alignment and fusion. However, they fail to
take into account the valuable semantic information that can greatly enhance
it; and flow-based methods heavily rely on the accuracy of a flow estimate
model, which may not provide precise flows given two low-resolution frames.
In this paper, we investigate a more robust and semantic-aware prior for
enhanced VSR by utilizing the Segment Anything Model (SAM), a powerful
foundational model that is less susceptible to image degradation. To use the
SAM-based prior, we propose a simple yet effective module -- SAM-guidEd
refinEment Module (SEEM), which can enhance both alignment and fusion
procedures by the utilization of semantic information. This light-weight
plug-in module is specifically designed to not only leverage the attention
mechanism for the generation of semantic-aware feature but also be easily and
seamlessly integrated into existing methods. Concretely, we apply our SEEM to
two representative methods, EDVR and BasicVSR, resulting in consistently
improved performance with minimal implementation effort, on three widely used
VSR datasets: Vimeo-90K, REDS and Vid4. More importantly, we found that the
proposed SEEM can advance the existing methods in an efficient tuning manner,
providing increased flexibility in adjusting the balance between performance
and the number of training parameters. Code will be open-source soon.
Related papers
- Rapid and Power-Aware Learned Optimization for Modular Receive Beamforming [27.09017677987757]
Multiple-input multiple-output (MIMO) systems play a key role in wireless communication technologies.
We propose a power-oriented optimization algorithm for beamforming in modular hybrid systems.
We show how power efficient beamforming can be encouraged by the learned, via boosting computation with low-resolution phase shifts.
arXiv Detail & Related papers (2024-08-01T10:19:25Z) - Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation [7.797154022794006]
Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches.
We propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models.
Our method achieves state-of-the-art performance while reducing the model parameters by 60%.
arXiv Detail & Related papers (2024-07-16T03:19:59Z) - A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.
A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.
In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.
We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z) - Middle-level Fusion for Lightweight RGB-D Salient Object Detection [81.43951906434175]
A novel lightweight RGB-D SOD model is presented in this paper.
With IMFF and L modules incorporated in the middle-level fusion structure, our proposed model has only 3.9M parameters and runs at 33 FPS.
The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.
arXiv Detail & Related papers (2021-04-23T11:37:15Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.