Space-Time Attention with Shifted Non-Local Search
- URL: http://arxiv.org/abs/2309.16849v2
- Date: Mon, 4 Dec 2023 16:44:45 GMT
- Title: Space-Time Attention with Shifted Non-Local Search
- Authors: Kent Gauen and Stanley Chan
- Abstract summary: Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location.
Small spatial inaccuracies significantly impact the attention module's quality.
This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets.
- Score: 1.7676816383911753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently computing attention maps for videos is challenging due to the
motion of objects between frames. While a standard non-local search is
high-quality for a window surrounding each query point, the window's small size
cannot accommodate motion. Methods for long-range motion use an auxiliary
network to predict the most similar key coordinates as offsets from each query
location. However, accurately predicting this flow field of offsets remains
challenging, even for large-scale networks. Small spatial inaccuracies
significantly impact the attention module's quality. This paper proposes a
search strategy that combines the quality of a non-local search with the range
of predicted offsets. The method, named Shifted Non-Local Search, executes a
small grid search surrounding the predicted offsets to correct small spatial
errors. Our method's in-place computation consumes 10 times less memory and is
over 3 times faster than previous work. Experimentally, correcting the small
spatial errors improves the video frame alignment quality by over 3 dB PSNR.
Our search upgrades existing space-time attention modules, which improves video
denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We
integrate our space-time attention module into a UNet-like architecture to
achieve state-of-the-art results on video denoising.
Related papers
- Fast graph-based denoising for point cloud color information [29.126056066012264]
In some applications, e.g., for live streaming using a 3D point cloud, real-time point cloud denoising methods are required to enhance the visual quality.
This paper proposes a fast graph-based denoising (FGBD) for a large-scale point cloud.
In our experiments, we succeeded in reducing the processing time dramatically while maintaining accuracy relative to conventional denoising methods.
arXiv Detail & Related papers (2024-01-18T04:51:41Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z) - Pseudo-Inverted Bottleneck Convolution for DARTS Search Space [35.50068534514941]
We introduce the Pseudo-Inverted Bottleneck (PIBConv) block intending to reduce the computational footprint of the inverted bottleneck block proposed in ConvNeXt.
Our proposed architecture is much less sensitive to evaluation layer count and outperforms a DARTS network with similar size significantly, at layer counts as small as 2.
arXiv Detail & Related papers (2022-12-31T22:56:04Z) - Video-based Person Re-identification without Bells and Whistles [49.51670583977911]
Video-based person re-identification (Re-ID) aims at matching the video tracklets with cropped video frames for identifying the pedestrians under different cameras.
There exists severe spatial and temporal misalignment for those cropped tracklets due to the imperfect detection and tracking results generated with obsolete methods.
We present a simple re-Detect and Link (DL) module which can effectively reduce those unexpected noise through applying the deep learning-based detection and tracking on the cropped tracklets.
arXiv Detail & Related papers (2021-05-22T10:17:38Z) - ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search [94.90294600817215]
We propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation.
In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions.
In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames.
arXiv Detail & Related papers (2021-05-21T06:36:40Z) - Stagnation Detection in Highly Multimodal Fitness Landscapes [0.0]
Stagnation detection has been proposed as a mechanism for randomized searchs to escape from local optima.
In this paper, we investigate a new mechanism called radius memory which can be added to stagnation detection to control the search radius more carefully.
We implement this idea in an algorithm called SD-RLS$textm$ and show compared to previous variants of stagnation detection that it yields speed-ups.
arXiv Detail & Related papers (2021-04-09T14:33:52Z) - ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse
Coding [86.40042104698792]
We formulate neural architecture search as a sparse coding problem.
In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search.
Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
arXiv Detail & Related papers (2020-10-13T04:34:24Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Expedited Multi-Target Search with Guaranteed Performance via
Multi-fidelity Gaussian Processes [9.434133337939496]
We consider a scenario in which an autonomous vehicle operates in a 3D environment and is tasked with searching for an unknown number of stationary targets on the 2D floor of the environment.
We model the sensing field using a multi-fidelity Gaussian process that systematically describes the sensing information available at different altitudes from the floor.
Based on the sensing model, we design a novel algorithm called Multi-Target Search (EMTS) that addresses the coverage-accuracy trade-off.
arXiv Detail & Related papers (2020-05-18T02:53:52Z) - Latency-Aware Differentiable Neural Architecture Search [113.35689580508343]
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space.
However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.
This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient.
arXiv Detail & Related papers (2020-01-17T15:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.