Weakly-Supervised Video Object Grounding via Causal Intervention
- URL: http://arxiv.org/abs/2112.00475v1
- Date: Wed, 1 Dec 2021 13:13:03 GMT
- Title: Weakly-Supervised Video Object Grounding via Causal Intervention
- Authors: Wei Wang, Junyu Gao, Changsheng Xu
- Abstract summary: We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning.
It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning.
- Score: 82.68192973503119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We target at the task of weakly-supervised video object grounding (WSVOG),
where only video-sentence annotations are available during model learning. It
aims to localize objects described in the sentence to visual regions in the
video, which is a fundamental capability needed in pattern analysis and machine
learning. Despite the recent progress, existing methods all suffer from the
severe problem of spurious association, which will harm the grounding
performance. In this paper, we start from the definition of WSVOG and pinpoint
the spurious association from two aspects: (1) the association itself is not
object-relevant but extremely ambiguous due to weak supervision, and (2) the
association is unavoidably confounded by the observational bias when taking the
statistics-based matching strategy in existing methods. With this in mind, we
design a unified causal framework to learn the deconfounded object-relevant
association for more accurate and robust video object grounding. Specifically,
we learn the object-relevant association by causal intervention from the
perspective of video data generation process. To overcome the problems of
lacking fine-grained supervision in terms of intervention, we propose a novel
spatial-temporal adversarial contrastive learning paradigm. To further remove
the accompanying confounding effect within the object-relevant association, we
pursue the true causality by conducting causal intervention via backdoor
adjustment. Finally, the deconfounded object-relevant association is learned
and optimized under a unified causal framework in an end-to-end manner.
Extensive experiments on both IID and OOD testing sets of three benchmarks
demonstrate its accurate and robust grounding performance against
state-of-the-arts.
Related papers
- Knowledge-guided Causal Intervention for Weakly-supervised Object
Localization [32.99508048913356]
KG-CI-CAM is a knowledge-guided causal intervention method.
We tackle the co-occurrence context confounder problem via causal intervention.
We introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge.
arXiv Detail & Related papers (2023-01-03T12:02:19Z) - Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos.
One of the main challenges in this task is the existence of background distractors that appear similar to the target objects.
We propose three novel strategies to suppress such distractors.
Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Bi-directional Object-context Prioritization Learning for Saliency
Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations.
We observe that spatial attention works concurrently with object-based attention in the human visual recognition system.
We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z) - Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z) - Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV)
This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering)
We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region.
Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.