CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding
- URL: http://arxiv.org/abs/2209.10918v2
- Date: Tue, 30 May 2023 02:03:34 GMT
- Title: CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video
Temporal Grounding
- Authors: Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong
Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
- Abstract summary: This paper tackles an emerging and challenging problem of long video temporal grounding(VTG)
Compared with short videos, long videos are also highly demanded but less explored.
We propose CONE, an efficient COarse-to-fiNE alignment framework.
- Score: 70.7882058229772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles an emerging and challenging problem of long video temporal
grounding~(VTG) that localizes video moments related to a natural language (NL)
query. Compared with short videos, long videos are also highly demanded but
less explored, which brings new challenges in higher inference computation cost
and weaker multi-modal alignment. To address these challenges, we propose CONE,
an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play
framework on top of existing VTG models to handle long videos through a sliding
window mechanism. Specifically, CONE (1) introduces a query-guided window
selection strategy to speed up inference, and (2) proposes a coarse-to-fine
mechanism via a novel incorporation of contrastive learning to enhance
multi-modal alignment for long videos. Extensive experiments on two large-scale
long VTG benchmarks consistently show both substantial performance gains (e.g.,
from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal
higher efficiency as the query-guided window selection mechanism accelerates
inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results.
Codes have been released at https://github.com/houzhijian/CONE.
Related papers
- Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.
These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - SnAG: Scalable and Accurate Video Grounding [10.578025234151596]
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding.
We study the effect of cross-modal fusion on the scalability of video grounding models.
We present SnAG, a simple baseline for scalable and accurate video grounding.
arXiv Detail & Related papers (2024-04-02T19:25:04Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.