End-to-End Dense Video Grounding via Parallel Regression
- URL: http://arxiv.org/abs/2109.11265v5
- Date: Wed, 28 Feb 2024 13:04:24 GMT
- Title: End-to-End Dense Video Grounding via Parallel Regression
- Authors: Fengyuan Shi, Weilin Huang, Limin Wang
- Abstract summary: Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
- Score: 30.984657885692553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video grounding aims to localize the corresponding video moment in an
untrimmed video given a language query. Existing methods often address this
task in an indirect way, by casting it as a proposal-and-match or
fusion-and-detection problem. Solving these surrogate problems often requires
sophisticated label assignment during training and hand-crafted removal of
near-duplicate results. Meanwhile, existing works typically focus on sparse
video grounding with a single sentence as input, which could result in
ambiguous localization due to its unclear description. In this paper, we tackle
a new problem of dense video grounding, by simultaneously localizing multiple
moments with a paragraph as input. From a perspective on video grounding as
language conditioned regression, we present an end-to-end parallel decoding
paradigm by re-purposing a Transformer-alike architecture (PRVG). The key
design in our PRVG is to use languages as queries, and directly regress the
moment boundaries based on language-modulated visual representations. Thanks to
its simplicity in design, our PRVG framework can be applied in different
testing schemes (sparse or dense grounding) and allows for efficient inference
without any post-processing technique. In addition, we devise a robust
proposal-level attention loss to guide the training of PRVG, which is invariant
to moment duration and contributes to model convergence. We perform experiments
on two video grounding benchmarks of ActivityNet Captions and TACoS,
demonstrating that our PRVG can significantly outperform previous methods. We
also perform in-depth studies to investigate the effectiveness of parallel
regression paradigm on video grounding.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - DemaFormer: Damped Exponential Moving Average Transformer with
Energy-Based Modeling for Temporal Language Grounding [32.45280955448672]
Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query.
We propose an energy-based model framework to explicitly learn moment-query distributions.
We also propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor.
arXiv Detail & Related papers (2023-12-05T07:37:21Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.