You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos
- URL: http://arxiv.org/abs/2303.07863v2
- Date: Thu, 16 Mar 2023 08:34:27 GMT
- Title: You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos
- Authors: Xiang Fang, Daizong Liu, Pan Zhou, Guoshun Nan
- Abstract summary: Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
- Score: 56.676761067861236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a
target moment semantically according to a sentence query. Although previous
respectable works have made decent success, they only focus on high-level
visual features extracted from the consecutive decoded frames and fail to
handle the compressed videos for query modelling, suffering from insufficient
representation capability and significant computational complexity during
training and testing. In this paper, we pose a new setting, compressed-domain
TSG, which directly utilizes compressed videos rather than fully-decompressed
frames as the visual input. To handle the raw video bit-stream input, we
propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF)
framework, which extracts and aggregates three kinds of low-level visual
features (I-frame, motion vector and residual features) for effective and
efficient grounding. Particularly, instead of encoding the whole decoded frames
like previous works, we capture the appearance representation by only learning
the I-frame feature to reduce delay or latency. Besides, we explore the motion
information not only by learning the motion vector feature, but also by
exploring the relations of neighboring frames via the residual feature. In this
way, a three-branch spatial-temporal attention layer with an adaptive
motion-appearance fusion module is further designed to extract and aggregate
both appearance and motion information for the final grounding. Experiments on
three challenging datasets shows that our TCSF achieves better performance than
other state-of-the-art methods with lower complexity.
Related papers
- Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora.
Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression.
We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z) - SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition [18.542942459854867]
Large amounts of video samples are continuously required for traditional data-driven research.
We propose a novel plug-and-play architecture for action recognition called Stemp-Oral frAme tuwenle (SOAP) in this paper.
SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and SOAP51.
arXiv Detail & Related papers (2024-07-23T09:45:25Z) - D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.
Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z) - Accelerated Event-Based Feature Detection and Compression for
Surveillance Video Systems [1.5390526524075634]
We propose a novel system which conveys temporal redundancy within a sparse decompressed representation.
We leverage a video representation framework called ADDER to transcode framed videos to sparse, asynchronous intensity samples.
Our work paves the way for upcoming neuromorphic sensors and is amenable to future applications with spiking neural networks.
arXiv Detail & Related papers (2023-12-13T15:30:29Z) - DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking
Tasks [76.24996889649744]
Masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS)
We propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos.
Our model sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets.
arXiv Detail & Related papers (2023-04-02T16:40:42Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.