CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding
- URL: http://arxiv.org/abs/2209.10918v2
- Date: Tue, 30 May 2023 02:03:34 GMT
- Title: CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding
- Authors: Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong
Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
- Abstract summary: This paper tackles an emerging and challenging problem of long video temporal grounding(VTG)
Compared with short videos, long videos are also highly demanded but less explored.
We propose CONE, an efficient COarse-to-fiNE alignment framework.
- Score: 70.7882058229772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles an emerging and challenging problem of long video temporal
grounding~(VTG) that localizes video moments related to a natural language (NL)
query. Compared with short videos, long videos are also highly demanded but
less explored, which brings new challenges in higher inference computation cost
and weaker multi-modal alignment. To address these challenges, we propose CONE,
an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play
framework on top of existing VTG models to handle long videos through a sliding
window mechanism. Specifically, CONE (1) introduces a query-guided window
selection strategy to speed up inference, and (2) proposes a coarse-to-fine
mechanism via a novel incorporation of contrastive learning to enhance
multi-modal alignment for long videos. Extensive experiments on two large-scale
long VTG benchmarks consistently show both substantial performance gains (e.g.,
from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal
higher efficiency as the query-guided window selection mechanism accelerates
inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results.
Codes have been released at https://github.com/houzhijian/CONE.
Related papers
- Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video.
Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations.
To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z) - SnAG: Scalable and Accurate Video Grounding [10.578025234151596]
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding.
We study the effect of cross-modal fusion on the scalability of video grounding models.
We present SnAG, a simple baseline for scalable and accurate video grounding.
arXiv Detail & Related papers (2024-04-02T19:25:04Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A
Faster and Better Framework [93.37833982180538]
Video deraining is an important task in computer vision as the unwanted rain hampers the visibility of videos and deteriorates the robustness of most outdoor vision systems.
We present a new end-to-end deraining framework, named Enhanced Spatio-Temporal Interaction Network (ESTINet)
ESTINet considerably boosts current state-of-the-art video deraining quality and speed.
arXiv Detail & Related papers (2021-03-23T05:19:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.