Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
- URL: http://arxiv.org/abs/2207.11441v1
- Date: Sat, 23 Jul 2022 07:06:06 GMT
- Title: Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
- Authors: Li Xu, Haoxuan Qu, Jason Kuen, Jiuxiang Gu and Jun Liu
- Abstract summary: We propose a novel Meta Video Scene Generation (MVSGG) framework to address bias problem.
Our framework first constructs a support set and a group query sets from the training data.
Then, by performing a meta training and testing process to optimize the model, our framework can effectively guide the model to learn well against biases.
- Score: 22.216881800098726
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video scene graph generation (VidSGG) aims to parse the video content into
scene graphs, which involves modeling the spatio-temporal contextual
information in the video. However, due to the long-tailed training data in
datasets, the generalization performance of existing VidSGG models can be
affected by the spatio-temporal conditional bias problem. In this work, from
the perspective of meta-learning, we propose a novel Meta Video Scene Graph
Generation (MVSGG) framework to address such a bias problem. Specifically, to
handle various types of spatio-temporal conditional biases, our framework first
constructs a support set and a group of query sets from the training data,
where the data distribution of each query set is different from that of the
support set w.r.t. a type of conditional bias. Then, by performing a novel meta
training and testing process to optimize the model to obtain good testing
performance on these query sets after training on the support set, our
framework can effectively guide the model to learn to well generalize against
biases. Extensive experiments demonstrate the efficacy of our proposed
framework.
Related papers
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Adaptive graph convolutional networks for weakly supervised anomaly
detection in videos [42.3118758940767]
We propose a weakly supervised adaptive graph convolutional network (WAGCN) to model the contextual relationships among video segments.
We fully consider the influence of other video segments on the current segment when generating the anomaly probability score for each segment.
arXiv Detail & Related papers (2022-02-14T06:31:34Z) - Towards Debiasing Temporal Sentence Grounding in Video [59.42702544312366]
temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query.
Without considering bias in moment annotations, many models tend to capture statistical regularities of the moment annotations.
We propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions.
arXiv Detail & Related papers (2021-11-08T08:18:25Z) - Greedy Gradient Ensemble for Robust Visual Question Answering [163.65789778416172]
We stress the language bias in Visual Question Answering (VQA) that comes from two aspects, i.e., distribution bias and shortcut bias.
We propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning.
GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
arXiv Detail & Related papers (2021-07-27T08:02:49Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query.
We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data.
We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.