Temporal Sentence Grounding in Streaming Videos
- URL: http://arxiv.org/abs/2308.07102v1
- Date: Mon, 14 Aug 2023 12:30:58 GMT
- Title: Temporal Sentence Grounding in Streaming Videos
- Authors: Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, and Liqiang
Nie
- Abstract summary: This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
- Score: 60.67022943824329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims to tackle a novel task - Temporal Sentence Grounding in
Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance
between a video stream and a given sentence query. Unlike regular videos,
streaming videos are acquired continuously from a particular source, and are
always desired to be processed on-the-fly in many applications such as
surveillance and live-stream analysis. Thus, TSGSV is challenging since it
requires the model to infer without future frames and process long historical
frames effectively, which is untouched in the early methods. To specifically
address the above challenges, we propose two novel methods: (1) a TwinNet
structure that enables the model to learn about upcoming events; and (2) a
language-guided feature compressor that eliminates redundant visual frames and
reinforces the frames that are relevant to the query. We conduct extensive
experiments using ActivityNet Captions, TACoS, and MAD datasets. The results
demonstrate the superiority of our proposed methods. A systematic ablation
study also confirms their effectiveness.
Related papers
- ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Weak Supervision and Referring Attention for Temporal-Textual
Association Learning [35.469984595398905]
We propose a Weak-Supervised alternative to learn temporal-textual association (dubbed WSRA)
The weak supervision is simply a textual expression at video level, indicating this video contains relevant frames.
The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally.
We validate our WSRA through extensive experiments for temporally grounding by languages, demonstrating that it outperforms the state-of-the-art weakly-supervised methods notably.
arXiv Detail & Related papers (2020-06-21T09:25:28Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.