Related papers: SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding

SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding

URL: http://arxiv.org/abs/2512.00936v1
Date: Sun, 30 Nov 2025 15:35:38 GMT
Title: SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding
Authors: Keita Otani, Tatsuya Harada,
Abstract summary: Grounding complex visual queries with multiple objects and relationships is a fundamental challenge for vision-language models.<n>Standard phrase grounding methods excel at localizing single objects, but lack the structural inductive bias to parse intricate relational descriptions.<n>We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF)
Score: 44.72928381789337
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grounding complex, compositional visual queries with multiple objects and relationships is a fundamental challenge for vision-language models. While standard phrase grounding methods excel at localizing single objects, they lack the structural inductive bias to parse intricate relational descriptions, often failing as queries become more descriptive. To address this structural deficit, we focus on scene-graph grounding, a powerful but less-explored formulation where the query is an explicit graph of objects and their relationships. However, existing methods for this task also struggle, paradoxically showing decreased performance as the query graph grows -- failing to leverage the very information that should make grounding easier. We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF). By performing global inference over the entire query graph, SceneProp finds the optimal assignment of image regions to nodes that jointly satisfies all constraints. This is achieved within an end-to-end framework via a differentiable implementation of the Belief Propagation algorithm. Experiments on four benchmarks show that our dedicated focus on the scene-graph grounding formulation allows SceneProp to significantly outperform prior work. Critically, its accuracy consistently improves with the size and complexity of the query graph, demonstrating for the first time that more relational context can, and should, lead to better grounding. Codes are available at https://github.com/keitaotani/SceneProp.

Related papers

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives [69.36723767339001]
We propose a novel framework named textitGPT4SGG to obtain more accurate and comprehensive scene graph signals. We show textitGPT4SGG significantly improves the performance of SGG models trained on image-caption data.
arXiv Detail & Related papers (2023-12-07T14:11:00Z)
Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation [0.7851536646859476]
We introduce the task of Efficient Scene Graph Generation (SGG) that prioritizes the generation of relevant relations. We present a new dataset, VG150-curated, based on the annotations of the popular Visual Genome dataset. We show through a set of experiments that this dataset contains more high-quality and diverse annotations than the one usually use in SGG.
arXiv Detail & Related papers (2023-05-30T00:55:49Z)
Location-Free Scene Graph Generation [45.366540803729386]
Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other.<n>Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion.<n>We break this dependency and introduce location-free scene graph generation (LF-SGG)<n>This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization.
arXiv Detail & Related papers (2023-03-20T08:57:45Z)
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing [17.63475613154152]
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints in a scene graph. A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image.
arXiv Detail & Related papers (2022-11-03T16:46:46Z)
Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query. We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE) We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z)
Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video) Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible. We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z)
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities. We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling. Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z)
Dual ResGCN for Balanced Scene GraphGeneration [106.7828712878278]
We propose a novel model, dubbed textitdual ResGCN, which consists of an object residual graph convolutional network and a relation residual graph convolutional network. The two networks are complementary to each other. The former captures object-level context information, textiti.e., the connections among objects. The latter is carefully designed to explicitly capture relation-level context information textiti.e., the connections among relations.
arXiv Detail & Related papers (2020-11-09T07:44:17Z)
Generative Compositional Augmentations for Scene Graph Prediction [27.535630110794855]
Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. We propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs.
arXiv Detail & Related papers (2020-07-11T12:11:53Z)
Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.