No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection
- URL: http://arxiv.org/abs/2307.10567v1
- Date: Thu, 20 Jul 2023 04:12:10 GMT
- Title: No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection
- Authors: Qi Zhang and Sipeng Zheng and Qin Jin
- Abstract summary: Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
- Score: 52.03562682785128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal video grounding (TVG) aims to retrieve the time interval of a
language query from an untrimmed video. A significant challenge in TVG is the
low "Semantic Noise Ratio (SNR)", which results in worse performance with lower
SNR. Prior works have addressed this challenge using sophisticated techniques.
In this paper, we propose a no-frills TVG model that consists of two core
modules, namely multi-scale neighboring attention and zoom-in boundary
detection. The multi-scale neighboring attention restricts each video token to
only aggregate visual contexts from its neighbor, enabling the extraction of
the most distinguishing information with multi-scale feature hierarchies from
high-ratio noises. The zoom-in boundary detection then focuses on local-wise
discrimination of the selected top candidates for fine-grained grounding
adjustment. With an end-to-end training strategy, our model achieves
competitive performance on different TVG benchmarks, while also having the
advantage of faster inference speed and lighter model parameters, thanks to its
lightweight architecture.
Related papers
- You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Generation-Guided Multi-Level Unified Network for Video Grounding [18.402093379973085]
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
arXiv Detail & Related papers (2023-03-14T09:48:59Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video.
Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content.
We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Robust Unsupervised Multi-Object Tracking in Noisy Environments [5.409476600348953]
We introduce a robust unsupervised multi-object tracking (MOT) model: AttU-Net.
The proposed single-head attention model helps limit the negative impact of noise by learning visual representations at different segment scales.
We evaluate our method in the MNIST and the Atari game video benchmark.
arXiv Detail & Related papers (2021-05-20T19:38:03Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.