STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding
- URL: http://arxiv.org/abs/2207.02756v1
- Date: Wed, 6 Jul 2022 15:48:58 GMT
- Title: STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding
- Authors: Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, Wei-Shi
Zheng
- Abstract summary: We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
- Score: 68.96574451918458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this technical report, we introduce our solution to human-centric
spatio-temporal video grounding task. We propose a concise and effective
framework named STVGFormer, which models spatiotemporal visual-linguistic
dependencies with a static branch and a dynamic branch. The static branch
performs cross-modal understanding in a single frame and learns to localize the
target object spatially according to intra-frame visual cues like object
appearances. The dynamic branch performs cross-modal understanding across
multiple frames. It learns to predict the starting and ending time of the
target moment according to dynamic visual cues like motions. Both the static
and dynamic branches are designed as cross-modal transformers. We further
design a novel static-dynamic interaction block to enable the static and
dynamic branches to transfer useful and complementary information from each
other, which is shown to be effective to improve the prediction on hard cases.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG
track of the 4th Person in Context Challenge.
Related papers
- DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline.
We propose dedicated representations to disentangle dynamic agents and static scene elements.
Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding [56.315932539150324]
We design a Unified Static and Dynamic Network (UniSDNet) to learn the semantic association between the video and text/audio queries.
Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks.
arXiv Detail & Related papers (2024-03-21T06:53:40Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Efficient 3D Reconstruction, Streaming and Visualization of Static and
Dynamic Scene Parts for Multi-client Live-telepresence in Large-scale
Environments [6.543101569579952]
We aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities.
Our system is able to achieve VR-based live-telepresence at close to real-time rates.
arXiv Detail & Related papers (2022-11-25T18:59:54Z) - Dynamic View Synthesis from Dynamic Monocular Video [69.80425724448344]
We present an algorithm for generating views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene.
We show extensive quantitative and qualitative results of dynamic view synthesis from casually captured videos.
arXiv Detail & Related papers (2021-05-13T17:59:50Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Empty Cities: a Dynamic-Object-Invariant Space for Visual SLAM [6.693607456009373]
We present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera.
We introduce an end-to-end deep learning framework to turn images of an urban environment into realistic static frames suitable for localization and mapping.
arXiv Detail & Related papers (2020-10-15T10:31:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.