Triple Correlations-Guided Label Supplementation for Unbiased Video
Scene Graph Generation
- URL: http://arxiv.org/abs/2307.16309v1
- Date: Sun, 30 Jul 2023 19:59:17 GMT
- Title: Triple Correlations-Guided Label Supplementation for Unbiased Video
Scene Graph Generation
- Authors: Wenqing Wang, Kaifeng Gao, Yawei Luo, Tao Jiang, Fei Gao, Jian Shao,
Jianwen Sun, Jun Xiao
- Abstract summary: Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships.
Current VidSGG methods have been found to perform poorly on less-represented predicates.
We propose an explicit solution by supplementing missing predicates that should be appear in the ground-truth annotations.
- Score: 27.844658260885744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based scene graph generation (VidSGG) is an approach that aims to
represent video content in a dynamic graph by identifying visual entities and
their relationships. Due to the inherently biased distribution and missing
annotations in the training data, current VidSGG methods have been found to
perform poorly on less-represented predicates. In this paper, we propose an
explicit solution to address this under-explored issue by supplementing missing
predicates that should be appear in the ground-truth annotations. Dubbed Trico,
our method seeks to supplement the missing predicates by exploring three
complementary spatio-temporal correlations. Guided by these correlations, the
missing labels can be effectively supplemented thus achieving an unbiased
predicate predictions. We validate the effectiveness of Trico on the most
widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments
demonstrate the state-of-the-art performance achieved by Trico, particularly on
those tail predicates.
Related papers
- PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.
PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.
PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z) - DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.
We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.
DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing [9.352570324002505]
Video Scene Graph Generation (VidSGG) aims to capture dynamic relationships among entities by sequentially analyzing video frames and integrating visual and semantic information.
We propose a VIsual and Semantic Awareness (VISA) framework for unbiased VidSGG.
arXiv Detail & Related papers (2025-03-01T16:31:02Z) - Ensemble Predicate Decoding for Unbiased Scene Graph Generation [40.01591739856469]
Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that captures semantic information of a given scenario.
The model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias.
This paper proposes Ensemble Predicate Decoding (EPD), which employs multiple decoders to attain unbiased scene graph generation.
arXiv Detail & Related papers (2024-08-26T11:24:13Z) - Few-shot Knowledge Graph Relational Reasoning via Subgraph Adaptation [51.47994645529258]
Few-shot Knowledge Graph (KG) Reasoning aims to predict unseen triplets (i.e., query triplets) for rare relations in KGs.
We propose SAFER (Subgraph Adaptation for Few-shot Reasoning), a novel approach that effectively adapts the information in contextualized graphs to various subgraphs.
arXiv Detail & Related papers (2024-06-19T21:40:35Z) - Leveraging Predicate and Triplet Learning for Scene Graph Generation [31.09787444957997]
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets.
We propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones.
Our method establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets.
arXiv Detail & Related papers (2024-06-04T07:23:41Z) - FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing [14.50214193838818]
FloCoDe: Flow-aware Temporal and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs.
We propose correlation debiasing and a correlation-based loss to learn unbiased relation representations for long-tailed classes.
arXiv Detail & Related papers (2023-10-24T14:59:51Z) - LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation [27.97296273461145]
Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach.
We propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG)
We show significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods.
arXiv Detail & Related papers (2023-10-16T13:49:46Z) - Multi-Label Meta Weighting for Long-Tailed Dynamic Scene Graph
Generation [55.429541407920304]
Recognizing the predicate between subject and object pairs is imbalanced and multi-label in nature.
Recent state-of-the-art methods predominantly focus on the most frequently occurring predicate classes.
We introduce a multi-label meta-learning framework to deal with the biased predicate distribution.
arXiv Detail & Related papers (2023-06-16T18:14:23Z) - Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph
Generation with Decoupled Label Learning [43.68357108342476]
We take a closer look at the predicates and identify that most visual relations involve both actional pattern (sit) and spatial pattern.
We propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective.
arXiv Detail & Related papers (2023-03-23T12:08:10Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Entailment Graph Learning with Textual Entailment and Soft Transitivity [69.91691115264132]
We propose a two-stage method, Entailment Graph with Textual Entailment and Transitivity (EGT2)
EGT2 learns local entailment relations by recognizing possible textual entailment between template sentences formed by CCG-parsed predicates.
Based on the generated local graph, EGT2 then uses three novel soft transitivity constraints to consider the logical transitivity in entailment structures.
arXiv Detail & Related papers (2022-04-07T08:33:06Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph
Generation [58.98802062945709]
We propose a novel Predicate-Correlation Perception Learning scheme to adaptively seek out appropriate loss weights.
Our PCPL framework is further equipped with a graph encoder module to better extract context features.
arXiv Detail & Related papers (2020-09-02T08:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.