Related papers: Synthetic Visual Genome

Synthetic Visual Genome

URL: http://arxiv.org/abs/2506.07643v1
Date: Mon, 09 Jun 2025 11:09:10 GMT
Title: Synthetic Visual Genome
Authors: Jae Sung Park, Zixian Ma, Linjie Li, Chenhao Zheng, Cheng-Yu Hsieh, Ximing Lu, Khyathi Chandu, Quan Kong, Norimasa Kobori, Ali Farhadi, Yejin Choi, Ranjay Krishna,
Abstract summary: We introduce ROBIN: an instruction-tuned with densely annotated relationships capable of constructing high-quality dense graphs at scale.<n>In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects.<n> ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks.
Score: 88.00433979509218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.

Related papers

SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D [51.32219731589742]
3D scene graphs provide a structured representation of object entities and their relationships.<n>Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs)<n>In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model)
arXiv Detail & Related papers (2026-03-04T21:19:54Z)
With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs [5.492064811668243]
This paper argues for the critical role of context and introduces a novel framework for contextual object classification.<n>We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image.<n>This explicit graph structure makes the model's reasoning process inherently interpretable.
arXiv Detail & Related papers (2025-12-28T17:53:55Z)
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning [44.61781303455069]
Mobile manipulators in households must both navigate and manipulate.<n>This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable.<n>We introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements.
arXiv Detail & Related papers (2025-12-18T18:59:03Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation [13.058196732927135]
Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets. We present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem.
arXiv Detail & Related papers (2024-03-21T23:43:30Z)
Semantic Compositional Learning for Low-shot Scene Graph Generation [122.51930904132685]
Many scene graph generation (SGG) models solely use the limited annotated relation triples for training. We propose a novel semantic compositional learning strategy that makes it possible to construct additional, realistic relation triples. For three recent SGG models, adding our strategy improves their performance by close to 50%, and all of them substantially exceed the current state-of-the-art.
arXiv Detail & Related papers (2021-08-19T10:13:55Z)
Mutual Graph Learning for Camouflaged Object Detection [31.422775969808434]
A major challenge is that intrinsic similarities between foreground objects and background surroundings make the features extracted by deep model indistinguishable. We design a novel Mutual Graph Learning model, which generalizes the idea of conventional mutual learning from regular grids to the graph domain. In contrast to most mutual learning approaches that use a shared function to model all between-task interactions, MGL is equipped with typed functions for handling different complementary relations.
arXiv Detail & Related papers (2021-04-03T10:14:39Z)
Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations. We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation. We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data. Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z)
Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
Scene Graph Generation via Conditional Random Fields [14.282277071380447]
We propose a novel scene graph generation model for predicting object instances and its corresponding relationships in an image. Our model, SG-CRF, learns the sequential order of subject and object in a relationship triplet, and the semantic compatibility of object nodes instance and relationship nodes in a scene graph efficiently.
arXiv Detail & Related papers (2018-11-20T04:55:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.