Related papers: Weakly-Supervised Video Object Grounding via Causal Intervention

Weakly-Supervised Video Object Grounding via Causal Intervention

URL: http://arxiv.org/abs/2112.00475v1
Date: Wed, 1 Dec 2021 13:13:03 GMT
Title: Weakly-Supervised Video Object Grounding via Causal Intervention
Authors: Wei Wang, Junyu Gao, Changsheng Xu
Abstract summary: We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning.
Score: 82.68192973503119
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning. Despite the recent progress, existing methods all suffer from the severe problem of spurious association, which will harm the grounding performance. In this paper, we start from the definition of WSVOG and pinpoint the spurious association from two aspects: (1) the association itself is not object-relevant but extremely ambiguous due to weak supervision, and (2) the association is unavoidably confounded by the observational bias when taking the statistics-based matching strategy in existing methods. With this in mind, we design a unified causal framework to learn the deconfounded object-relevant association for more accurate and robust video object grounding. Specifically, we learn the object-relevant association by causal intervention from the perspective of video data generation process. To overcome the problems of lacking fine-grained supervision in terms of intervention, we propose a novel spatial-temporal adversarial contrastive learning paradigm. To further remove the accompanying confounding effect within the object-relevant association, we pursue the true causality by conducting causal intervention via backdoor adjustment. Finally, the deconfounded object-relevant association is learned and optimized under a unified causal framework in an end-to-end manner. Extensive experiments on both IID and OOD testing sets of three benchmarks demonstrate its accurate and robust grounding performance against state-of-the-arts.

Related papers

Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO. This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors. Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z)
Knowledge-guided Causal Intervention for Weakly-supervised Object Localization [32.99508048913356]
KG-CI-CAM is a knowledge-guided causal intervention method. We tackle the co-occurrence context confounder problem via causal intervention. We introduce a multi-source knowledge guidance framework to strike a balance between absorbing classification knowledge and localization knowledge.
arXiv Detail & Related papers (2023-01-03T12:02:19Z)
Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos. One of the main challenges in this task is the existence of background distractors that appear similar to the target objects. We propose three novel strategies to suppress such distractors. Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Bi-directional Object-context Prioritization Learning for Saliency Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations. We observe that spatial attention works concurrently with object-based attention in the human visual recognition system. We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z)
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones. SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders. Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Video Salient Object Detection via Contrastive Features and Attention Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection. A co-attention formulation is utilized to combine the low-level and high-level features. We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z)
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities. We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling. Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z)
Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV) This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering) We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region. Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.