Single-Stage Visual Relationship Learning using Conditional Queries
- URL: http://arxiv.org/abs/2306.05689v1
- Date: Fri, 9 Jun 2023 06:02:01 GMT
- Title: Single-Stage Visual Relationship Learning using Conditional Queries
- Authors: Alakh Desai, Tz-Ying Wu, Subarna Tripathi, Nuno Vasconcelos
- Abstract summary: TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
- Score: 60.90880759475021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research in scene graph generation (SGG) usually considers two-stage models,
that is, detecting a set of entities, followed by combining them and labeling
all possible relationships. While showing promising results, the pipeline
structure induces large parameter and computation overhead, and typically
hinders end-to-end optimizations. To address this, recent research attempts to
train single-stage models that are computationally efficient. With the advent
of DETR, a set based detection model, one-stage models attempt to predict a set
of subject-predicate-object triplets directly in a single shot. However, SGG is
inherently a multi-task learning problem that requires modeling entity and
predicate distributions simultaneously. In this paper, we propose Transformers
with conditional queries for SGG, namely, TraCQ with a new formulation for SGG
that avoids the multi-task learning problem and the combinatorial entity pair
distribution. We employ a DETR-based encoder-decoder design and leverage
conditional queries to significantly reduce the entity label space as well,
which leads to 20% fewer parameters compared to state-of-the-art single-stage
models. Experimental results show that TraCQ not only outperforms existing
single-stage scene graph generation methods, it also beats many
state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable
of end-to-end training and faster inference.
Related papers
- Few-shot Prompting for Pairwise Ranking: An Effective Non-Parametric Retrieval Model [18.111868378615206]
We propose a pairwise few-shot ranker that achieves a close performance to that of a supervised model without requiring any complex training pipeline.
Our method also achieves a close performance to that of a supervised model without requiring any complex training pipeline.
arXiv Detail & Related papers (2024-09-26T11:19:09Z) - Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation [57.69385990442078]
Hydra-SGG achieves state-of-the-art performance with 10.6 mR@20 and 16.0 mR@50 on VG150, while only requiring 12 training epochs.
It also sets a new state-of-the-art on Open Images V6 and and GQA.
arXiv Detail & Related papers (2024-09-16T13:13:06Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - RelTR: Relation Transformer for Scene Graph Generation [34.1193503312965]
We propose an end-to-end scene graph generation model RelTR with an encoder-decoder architecture.
The model infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms.
Experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.
arXiv Detail & Related papers (2022-01-27T11:53:41Z) - Query Training: Learning a Worse Model to Infer Better Marginals in
Undirected Graphical Models with Hidden Variables [11.985433487639403]
Probabilistic graphical models (PGMs) provide a compact representation of knowledge that can be queried in a flexible way.
We introduce query training (QT), a mechanism to learn a PGM that is optimized for the approximate inference algorithm that will be paired with it.
We demonstrate experimentally that QT can be used to learn a challenging 8-connected grid Markov random field with hidden variables.
arXiv Detail & Related papers (2020-06-11T20:34:32Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.