A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach
- URL: http://arxiv.org/abs/2203.05243v1
- Date: Thu, 10 Mar 2022 08:58:18 GMT
- Title: A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach
- Authors: Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma and
Wenwu Zhu
- Abstract summary: Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
- Score: 53.727460222955266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural
language sentence in an untrimmed video, has drawn widespread attention over
the past few years. However, recent studies have found that current benchmark
datasets may have obvious moment annotation biases, enabling several simple
baselines even without training to achieve SOTA performance. In this paper, we
take a closer look at existing evaluation protocols, and find both the
prevailing dataset and evaluation metrics are the devils that lead to
untrustworthy benchmarking. Therefore, we propose to re-organize the two
widely-used datasets, making the ground-truth moment distributions different in
the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile,
we introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic
recall scores to alleviate the inflating evaluation caused by biased datasets.
New benchmarking results indicate that our proposed evaluation protocols can
better monitor the research progress. Furthermore, we propose a novel
causality-based Multi-branch Deconfounding Debiasing (MDD) framework for
unbiased moment prediction. Specifically, we design a multi-branch deconfounder
to eliminate the effects caused by multiple confounders with causal
intervention. In order to help the model better align the semantics between
sentence queries and video moments, we enhance the representations during
feature encoding. Specifically, for textual information, the query is parsed
into several verb-centered phrases to obtain a more fine-grained textual
feature. For visual information, the positional information has been decomposed
from moment features to enhance representations of moments with diverse
locations. Extensive experiments demonstrate that our proposed approach can
achieve competitive results among existing SOTA approaches and outperform the
base model with great gains.
Related papers
- Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal
Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips.
We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z) - MomentDiff: Generative Video Moment Retrieval from Random to Real [71.40038773943638]
We provide a generative diffusion-based framework called MomentDiff.
MomentDiff simulates a typical human retrieval process from random browsing to gradual localization.
We show that MomentDiff consistently outperforms state-of-the-art methods on three public benchmarks.
arXiv Detail & Related papers (2023-07-06T09:12:13Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Learning Sample Importance for Cross-Scenario Video Temporal Grounding [30.82619216537177]
The paper investigates some superficial biases specific to the temporal grounding task.
We propose a novel method called Debiased Temporal Language Localizer (DebiasTLL) to prevent the model from naively memorizing the biases.
We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.
arXiv Detail & Related papers (2022-01-08T15:41:38Z) - Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z) - A Closer Look at Temporal Sentence Grounding in Videos: Datasets and
Metrics [70.45937234489044]
We re- organize two widely-used TSGV datasets (Charades-STA and ActivityNet Captions) to make it different from the training split.
We introduce a new evaluation metric "dR@$n$,IoU@$m$" to calibrate the basic IoU scores.
All the results demonstrate that the re-organized datasets and new metric can better monitor the progress in TSGV.
arXiv Detail & Related papers (2021-01-22T09:59:30Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.