Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph
Generation with Decoupled Label Learning
- URL: http://arxiv.org/abs/2303.13209v1
- Date: Thu, 23 Mar 2023 12:08:10 GMT
- Title: Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph
Generation with Decoupled Label Learning
- Authors: Wenqing Wang, Yawei Luo, Zhiqing Chen, Tao Jiang, Lei Chen, Yi Yang,
Jun Xiao
- Abstract summary: We take a closer look at the predicates and identify that most visual relations involve both actional pattern (sit) and spatial pattern.
We propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective.
- Score: 43.68357108342476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video-based scene graph generation (VidSGG) methods have been found
to perform poorly on predicting predicates that are less represented due to the
inherent biased distribution in the training data. In this paper, we take a
closer look at the predicates and identify that most visual relations (e.g.
sit_above) involve both actional pattern (sit) and spatial pattern (above),
while the distribution bias is much less severe at the pattern level. Based on
this insight, we propose a decoupled label learning (DLL) paradigm to address
the intractable visual relation prediction from the pattern-level perspective.
Specifically, DLL decouples the predicate labels and adopts separate
classifiers to learn actional and spatial patterns respectively. The patterns
are then combined and mapped back to the predicate. Moreover, we propose a
knowledge-level label decoupling method to transfer non-target knowledge from
head predicates to tail predicates within the same pattern to calibrate the
distribution of tail classes. We validate the effectiveness of DLL on the
commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate
that the DLL offers a remarkably simple but highly effective solution to the
long-tailed problem, achieving the state-of-the-art VidSGG performance.
Related papers
- PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.
PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.
PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z) - Weakly Supervised Video Individual CountingWeakly Supervised Video
Individual Counting [126.75545291243142]
Video Individual Counting aims to predict the number of unique individuals in a single video.
We introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining.
arXiv Detail & Related papers (2023-12-10T16:12:13Z) - FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing [14.50214193838818]
FloCoDe: Flow-aware Temporal and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs.
We propose correlation debiasing and a correlation-based loss to learn unbiased relation representations for long-tailed classes.
arXiv Detail & Related papers (2023-10-24T14:59:51Z) - Triple Correlations-Guided Label Supplementation for Unbiased Video
Scene Graph Generation [27.844658260885744]
Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships.
Current VidSGG methods have been found to perform poorly on less-represented predicates.
We propose an explicit solution by supplementing missing predicates that should be appear in the ground-truth annotations.
arXiv Detail & Related papers (2023-07-30T19:59:17Z) - LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation [34.40862385518366]
Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem.
We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns.
This framework is model-agnostic and consistently improves performance on existing SGG models.
arXiv Detail & Related papers (2023-03-02T09:03:11Z) - Adaptive Fine-Grained Predicates Learning for Scene Graph Generation [122.4588401267544]
General Scene Graph Generation (SGG) models tend to predict head predicates and re-balancing strategies prefer tail categories.
We propose an Adaptive Fine-Grained Predicates Learning (FGPL-A) which aims at differentiating hard-to-distinguish predicates for SGG.
Our proposed model-agnostic strategy significantly boosts performance of benchmark models on VG-SGG and GQA-SGG datasets by up to 175% and 76% on Mean Recall@100, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2022-07-11T03:37:57Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Temporal Contrastive Graph Learning for Video Action Recognition and
Retrieval [83.56444443849679]
This work takes advantage of the temporal dependencies within videos and proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL)
Our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning.
Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
arXiv Detail & Related papers (2021-01-04T08:11:39Z) - Learning Graph-Based Priors for Generalized Zero-Shot Learning [21.43100823741393]
zero-shot learning (ZSL) requires correctly predicting the label of samples from classes which were unseen at training time.
Recent approaches to GZSL have shown the value of generative models, which are used to generate samples from unseen classes.
In this work, we incorporate an additional source of side information in the form of a relation graph over labels.
arXiv Detail & Related papers (2020-10-22T01:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.