ViGT: Proposal-free Video Grounding with Learnable Token in Transformer
- URL: http://arxiv.org/abs/2308.06009v1
- Date: Fri, 11 Aug 2023 08:30:08 GMT
- Title: ViGT: Proposal-free Video Grounding with Learnable Token in Transformer
- Authors: Kun Li, Dan Guo, Meng Wang
- Abstract summary: Video grounding task aims to locate queried action or event in an untrimmed video based on rich linguistic descriptions.
Existing proposal-free methods are trapped in complex interaction between video and query.
We propose a novel boundary regression paradigm that performs regression token learning in a transformer.
- Score: 28.227291816020646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The video grounding (VG) task aims to locate the queried action or event in
an untrimmed video based on rich linguistic descriptions. Existing
proposal-free methods are trapped in complex interaction between video and
query, overemphasizing cross-modal feature fusion and feature correlation for
VG. In this paper, we propose a novel boundary regression paradigm that
performs regression token learning in a transformer. Particularly, we present a
simple but effective proposal-free framework, namely Video Grounding
Transformer (ViGT), which predicts the temporal boundary using a learnable
regression token rather than multi-modal or cross-modal features. In ViGT, the
benefits of a learnable token are manifested as follows. (1) The token is
unrelated to the video or the query and avoids data bias toward the original
video and query. (2) The token simultaneously performs global context
aggregation from video and query features. First, we employed a sharing feature
encoder to project both video and query into a joint feature space before
performing cross-modal co-attention (i.e., video-to-query attention and
query-to-video attention) to highlight discriminative features in each
modality. Furthermore, we concatenated a learnable regression token [REG] with
the video and query features as the input of a vision-language transformer.
Finally, we utilized the token [REG] to predict the target moment and visual
features to constrain the foreground and background probabilities at each
timestamp. The proposed ViGT performed well on three public datasets: ANet
Captions, TACoS and YouCookII. Extensive ablation studies and qualitative
analysis further validated the interpretability of ViGT.
Related papers
- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Video Graph Transformer for Video Question Answering [182.14696075946742]
This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA)
We show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pre-training-free scenario.
arXiv Detail & Related papers (2022-07-12T06:51:32Z) - Modality-Balanced Embedding for Video Retrieval [21.81705847039759]
We identify a modality bias phenomenon that the video encoder almost entirely relies on text matching.
We propose MBVR (short for Modality Balanced Video Retrieval) with two key components.
We show empirically that our method is both effective and efficient in solving modality bias problem.
arXiv Detail & Related papers (2022-04-18T06:29:46Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.