STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding
- URL: http://arxiv.org/abs/2207.02756v1
- Date: Wed, 6 Jul 2022 15:48:58 GMT
- Title: STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding
- Authors: Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, Wei-Shi
Zheng
- Abstract summary: We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
- Score: 68.96574451918458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this technical report, we introduce our solution to human-centric
spatio-temporal video grounding task. We propose a concise and effective
framework named STVGFormer, which models spatiotemporal visual-linguistic
dependencies with a static branch and a dynamic branch. The static branch
performs cross-modal understanding in a single frame and learns to localize the
target object spatially according to intra-frame visual cues like object
appearances. The dynamic branch performs cross-modal understanding across
multiple frames. It learns to predict the starting and ending time of the
target moment according to dynamic visual cues like motions. Both the static
and dynamic branches are designed as cross-modal transformers. We further
design a novel static-dynamic interaction block to enable the static and
dynamic branches to transfer useful and complementary information from each
other, which is shown to be effective to improve the prediction on hard cases.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG
track of the 4th Person in Context Challenge.
Related papers
- Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation [9.964615076037397]
Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping.
Previous efforts have primarily focused on pixel-level static-dynamic contexts matching.
This paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency framework.
arXiv Detail & Related papers (2024-12-11T02:29:51Z) - Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos [101.48581851337703]
We present BTimer, the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes.
Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target ('bullet') timestamp by aggregating information from all the context frames.
Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets.
arXiv Detail & Related papers (2024-12-04T18:15:06Z) - DualAD: Disentangling the Dynamic and Static World for End-to-End Driving [11.379456277711379]
State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline.
We propose dedicated representations to disentangle dynamic agents and static scene elements.
Our method titled DualAD outperforms independently trained single-task networks.
arXiv Detail & Related papers (2024-06-10T13:46:07Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Efficient 3D Reconstruction, Streaming and Visualization of Static and
Dynamic Scene Parts for Multi-client Live-telepresence in Large-scale
Environments [6.543101569579952]
We aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities.
Our system is able to achieve VR-based live-telepresence at close to real-time rates.
arXiv Detail & Related papers (2022-11-25T18:59:54Z) - Dynamic View Synthesis from Dynamic Monocular Video [69.80425724448344]
We present an algorithm for generating views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene.
We show extensive quantitative and qualitative results of dynamic view synthesis from casually captured videos.
arXiv Detail & Related papers (2021-05-13T17:59:50Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.