Coarse to Fine Multi-Resolution Temporal Convolutional Network
- URL: http://arxiv.org/abs/2105.10859v1
- Date: Sun, 23 May 2021 06:07:40 GMT
- Title: Coarse to Fine Multi-Resolution Temporal Convolutional Network
- Authors: Dipika Singhania, Rahul Rahaman, Angela Yao
- Abstract summary: We propose a novel temporal encoder-decoder to tackle the problem of sequence fragmentation.
The decoder follows a coarse-to-fine structure with an implicit ensemble of multiple temporal resolutions.
Experiments show that our stand-alone architecture, together with our novel feature-augmentation strategy and new loss, outperforms the state-of-the-art on three temporal video segmentation benchmarks.
- Score: 25.08516972520265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal convolutional networks (TCNs) are a commonly used architecture for
temporal video segmentation. TCNs however, tend to suffer from
over-segmentation errors and require additional refinement modules to ensure
smoothness and temporal coherency. In this work, we propose a novel temporal
encoder-decoder to tackle the problem of sequence fragmentation. In particular,
the decoder follows a coarse-to-fine structure with an implicit ensemble of
multiple temporal resolutions. The ensembling produces smoother segmentations
that are more accurate and better-calibrated, bypassing the need for additional
refinement modules. In addition, we enhance our training with a
multi-resolution feature-augmentation strategy to promote robustness to varying
temporal resolutions. Finally, to support our architecture and encourage
further sequence coherency, we propose an action loss that penalizes
misclassifications at the video level. Experiments show that our stand-alone
architecture, together with our novel feature-augmentation strategy and new
loss, outperforms the state-of-the-art on three temporal video segmentation
benchmarks.
Related papers
- Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise.
One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions.
We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Differentiable Resolution Compression and Alignment for Efficient Video
Classification and Retrieval [16.497758750494537]
We propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism.
We leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features.
We introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions.
arXiv Detail & Related papers (2023-09-15T05:31:53Z) - Cross-Consistent Deep Unfolding Network for Adaptive All-In-One Video
Restoration [78.14941737723501]
We propose a Cross-consistent Deep Unfolding Network (CDUN) for All-In-One VR.
By orchestrating two cascading procedures, CDUN achieves adaptive processing for diverse degradations.
In addition, we introduce a window-based inter-frame fusion strategy to utilize information from more adjacent frames.
arXiv Detail & Related papers (2023-09-04T14:18:00Z) - Continuous Space-Time Video Super-Resolution Utilizing Long-Range
Temporal Information [48.20843501171717]
We propose a continuous ST-VSR (CSTVSR) method that can convert the given video to any frame rate and spatial resolution.
We show that the proposed algorithm has good flexibility and achieves better performance on various datasets.
arXiv Detail & Related papers (2023-02-26T08:02:39Z) - Temporal Consistency Learning of inter-frames for Video Super-Resolution [38.26035126565062]
Video super-resolution (VSR) is a task that aims to reconstruct high-resolution (HR) frames from the low-resolution (LR) reference frame and multiple neighboring frames.
Existing methods generally explore information propagation and frame alignment to improve the performance of VSR.
We propose a Temporal Consistency learning Network (TCNet) for VSR in an end-to-end manner, to enhance the consistency of the reconstructed videos.
arXiv Detail & Related papers (2022-11-03T08:23:57Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - Revisiting Temporal Alignment for Video Restoration [39.05100686559188]
Long-range temporal alignment is critical yet challenging for video restoration tasks.
We present a novel, generic iterative alignment module which employs a gradual refinement scheme for sub-alignments.
Our model achieves state-of-the-art performance on multiple benchmarks across a range of video restoration tasks.
arXiv Detail & Related papers (2021-11-30T11:08:52Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z) - iSeeBetter: Spatio-temporal video super-resolution using recurrent
generative back-projection networks [0.0]
We present iSeeBetter, a novel GAN-based structural-temporal approach to video super-resolution (VSR)
iSeeBetter extracts spatial and temporal information from the current and neighboring frames using the concept of recurrent back-projection networks as its generator.
Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance.
arXiv Detail & Related papers (2020-06-13T01:36:30Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.