Semantic Lens: Instance-Centric Semantic Alignment for Video
Super-Resolution
- URL: http://arxiv.org/abs/2312.07823v4
- Date: Fri, 19 Jan 2024 12:18:28 GMT
- Title: Semantic Lens: Instance-Centric Semantic Alignment for Video
Super-Resolution
- Authors: Qi Tang, Yao Zhao, Meiqin Liu, Jian Jin, and Chao Yao
- Abstract summary: inter-frame alignment is a critical clue of video super-resolution (VSR)
We introduce a novel paradigm for VSR named Semantic Lens.
Video is modeled as instances, events, and scenes via a Semantic Extractor.
- Score: 36.48329560039897
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As a critical clue of video super-resolution (VSR), inter-frame alignment
significantly impacts overall performance. However, accurate pixel-level
alignment is a challenging task due to the intricate motion interweaving in the
video. In response to this issue, we introduce a novel paradigm for VSR named
Semantic Lens, predicated on semantic priors drawn from degraded videos.
Specifically, video is modeled as instances, events, and scenes via a Semantic
Extractor. Those semantics assist the Pixel Enhancer in understanding the
recovered contents and generating more realistic visual results. The distilled
global semantics embody the scene information of each frame, while the
instance-specific semantics assemble the spatial-temporal contexts related to
each instance. Furthermore, we devise a Semantics-Powered Attention
Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic
knowledge, composed of a Global Perspective Shifter (GPS) and an
Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module
generates pairs of affine transformation parameters for pixel-level feature
modulation conditioned on global semantics. After that, the ISEE module
harnesses the attention mechanism to align the adjacent frames in the
instance-centric semantic space. In addition, we incorporate a simple yet
effective pre-alignment module to alleviate the difficulty of model training.
Extensive experiments demonstrate the superiority of our model over existing
state-of-the-art VSR methods.
Related papers
- SMC++: Masked Learning of Unsupervised Video Semantic Compression [54.62883091552163]
We propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics.
MVM is proficient at learning generalizable semantics through the masked patch prediction task.
It may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises.
arXiv Detail & Related papers (2024-06-07T09:06:40Z) - Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition [16.828560953073495]
We propose a novel "Align before Adapt" (ALT) paradigm for video representation learning.
We exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus.
ALT demonstrates competitive performance while maintaining remarkably low computational costs.
arXiv Detail & Related papers (2023-11-27T08:32:28Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily
Oriented Scene Text Recognition [26.571128345615108]
We propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to recognize the texts in scene images.
The proposed method first generates the simple semantic feature using Semantic GAN and then recognizes the scene text with the Balanced Attention Module.
arXiv Detail & Related papers (2022-07-21T01:41:53Z) - Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis [71.56830815617553]
A fine-grained part-level semantic layout will benefit object details generation.
A Shape-aware Position Descriptor (SPD) is proposed to describe each pixel's positional feature.
A Semantic-shape Adaptive Feature Modulation (SAFM) block is proposed to combine the given semantic map and our positional features.
arXiv Detail & Related papers (2022-03-31T09:06:04Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.