Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training
Framework for Temporal Grounding
- URL: http://arxiv.org/abs/2207.14698v1
- Date: Fri, 29 Jul 2022 14:11:48 GMT
- Title: Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training
Framework for Temporal Grounding
- Authors: Jiachang Hao, Haifeng Sun, Pengfei Ren, Jingyu Wang, Qi Qi and Jianxin
Liao
- Abstract summary: Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video.
Previous methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets.
This paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy.
- Score: 20.185272219985787
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporal grounding aims to locate a target video moment that semantically
corresponds to the given sentence query in an untrimmed video. However, recent
works find that existing methods suffer a severe temporal bias problem. These
methods do not reason the target moment locations based on the visual-textual
semantic alignment but over-rely on the temporal biases of queries in training
sets. To this end, this paper proposes a novel training framework for grounding
models to use shuffled videos to address temporal bias problem without losing
grounding accuracy. Our framework introduces two auxiliary tasks, cross-modal
matching and temporal order discrimination, to promote the grounding model
training. The cross-modal matching task leverages the content consistency
between shuffled and original videos to force the grounding model to mine
visual contents to semantically match queries. The temporal order
discrimination task leverages the difference in temporal order to strengthen
the understanding of long-term temporal contexts. Extensive experiments on
Charades-STA and ActivityNet Captions demonstrate the effectiveness of our
method for mitigating the reliance on temporal biases and strengthening the
model's generalization ability against the different temporal distributions.
Code is available at https://github.com/haojc/ShufflingVideosForTSG.
Related papers
- Disentangle and denoise: Tackling context misalignment for video moment retrieval [16.939535169282262]
Video Moment Retrieval aims to locate in-context video moments according to a natural language query.
This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
arXiv Detail & Related papers (2024-08-14T15:00:27Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Contrastive Language-Action Pre-training for Temporal Localization [64.34349213254312]
Long-form video understanding requires approaches that are able to temporally localize activities or language.
These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations.
We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions.
arXiv Detail & Related papers (2022-04-26T13:17:50Z) - Learning Sample Importance for Cross-Scenario Video Temporal Grounding [30.82619216537177]
The paper investigates some superficial biases specific to the temporal grounding task.
We propose a novel method called Debiased Temporal Language Localizer (DebiasTLL) to prevent the model from naively memorizing the biases.
We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.
arXiv Detail & Related papers (2022-01-08T15:41:38Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.