SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
- URL: http://arxiv.org/abs/2407.05118v2
- Date: Mon, 15 Jul 2024 16:53:17 GMT
- Title: SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
- Authors: Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kordjamshidi, Yu Kong,
- Abstract summary: Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence.
We propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo.
We introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries.
- Score: 52.98133831401225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.
Related papers
- Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z) - Regularized Contrastive Learning of Semantic Search [0.0]
Transformer-based models are widely used as retrieval models due to their excellent ability to learn semantic representations.
We propose a new regularization method: Regularized Contrastive Learning.
It augments several different semantic representations for every sentence, then take them into the contrastive objective as regulators.
arXiv Detail & Related papers (2022-09-27T08:25:19Z) - SeqZero: Few-shot Compositional Semantic Parsing with Sequential Prompts
and Zero-shot Models [57.29358388475983]
Recent research showed promising results on combining pretrained language models with canonical utterance.
We propose a novel few-shot semantic parsing method -- SeqZero.
In particular, SeqZero brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.
arXiv Detail & Related papers (2022-05-15T21:13:15Z) - Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Grounded Graph Decoding Improves Compositional Generalization in
Question Answering [68.72605660152101]
Question answering models struggle to generalize to novel compositions of training patterns, such as longer sequences or more complex test structures.
We propose Grounded Graph Decoding, a method to improve compositional generalization of language representations by grounding structured predictions with an attention mechanism.
Our model significantly outperforms state-of-the-art baselines on the Compositional Freebase Questions (CFQ) dataset, a challenging benchmark for compositional generalization in question answering.
arXiv Detail & Related papers (2021-11-05T17:50:14Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Compositional Generalization and Natural Language Variation: Can a
Semantic Parsing Approach Handle Both? [27.590858384414567]
We ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization?
We propose new train and test splits of non-synthetic datasets to better assess this capability.
We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model.
arXiv Detail & Related papers (2020-10-24T00:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.