Semantic Lens: Instance-Centric Semantic Alignment for Video
Super-Resolution
- URL: http://arxiv.org/abs/2312.07823v4
- Date: Fri, 19 Jan 2024 12:18:28 GMT
- Title: Semantic Lens: Instance-Centric Semantic Alignment for Video
Super-Resolution
- Authors: Qi Tang, Yao Zhao, Meiqin Liu, Jian Jin, and Chao Yao
- Abstract summary: inter-frame alignment is a critical clue of video super-resolution (VSR)
We introduce a novel paradigm for VSR named Semantic Lens.
Video is modeled as instances, events, and scenes via a Semantic Extractor.
- Score: 36.48329560039897
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As a critical clue of video super-resolution (VSR), inter-frame alignment
significantly impacts overall performance. However, accurate pixel-level
alignment is a challenging task due to the intricate motion interweaving in the
video. In response to this issue, we introduce a novel paradigm for VSR named
Semantic Lens, predicated on semantic priors drawn from degraded videos.
Specifically, video is modeled as instances, events, and scenes via a Semantic
Extractor. Those semantics assist the Pixel Enhancer in understanding the
recovered contents and generating more realistic visual results. The distilled
global semantics embody the scene information of each frame, while the
instance-specific semantics assemble the spatial-temporal contexts related to
each instance. Furthermore, we devise a Semantics-Powered Attention
Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic
knowledge, composed of a Global Perspective Shifter (GPS) and an
Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module
generates pairs of affine transformation parameters for pixel-level feature
modulation conditioned on global semantics. After that, the ISEE module
harnesses the attention mechanism to align the adjacent frames in the
instance-centric semantic space. In addition, we incorporate a simple yet
effective pre-alignment module to alleviate the difficulty of model training.
Extensive experiments demonstrate the superiority of our model over existing
state-of-the-art VSR methods.
Related papers
- CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.
Our method integrates mutualtemporal information from videos with spatial information from sampled frames.
This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.
We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.
We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - Towards Open-Vocabulary Video Semantic Segmentation [40.58291642595943]
We introduce the Open Vocabulary Video Semantic (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories.
To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module.
Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context.
arXiv Detail & Related papers (2024-12-12T14:53:16Z) - SMC++: Masked Learning of Unsupervised Video Semantic Compression [54.62883091552163]
We propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics.
MVM is proficient at learning generalizable semantics through the masked patch prediction task.
It may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises.
arXiv Detail & Related papers (2024-06-07T09:06:40Z) - Global Motion Understanding in Large-Scale Video Object Segmentation [0.499320937849508]
We show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object (VOS) under complex circumstances.
Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object.
We present WarpFormer, an architecture for semi-supervised Video Object that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching.
arXiv Detail & Related papers (2024-05-11T15:09:22Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily
Oriented Scene Text Recognition [26.571128345615108]
We propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to recognize the texts in scene images.
The proposed method first generates the simple semantic feature using Semantic GAN and then recognizes the scene text with the Balanced Attention Module.
arXiv Detail & Related papers (2022-07-21T01:41:53Z) - Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis [71.56830815617553]
A fine-grained part-level semantic layout will benefit object details generation.
A Shape-aware Position Descriptor (SPD) is proposed to describe each pixel's positional feature.
A Semantic-shape Adaptive Feature Modulation (SAFM) block is proposed to combine the given semantic map and our positional features.
arXiv Detail & Related papers (2022-03-31T09:06:04Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.