Visual Composite Set Detection Using Part-and-Sum Transformers
- URL: http://arxiv.org/abs/2105.02170v1
- Date: Wed, 5 May 2021 16:31:32 GMT
- Title: Visual Composite Set Detection Using Part-and-Sum Transformers
- Authors: Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vijay Mahadevan,
Stefano Soatto
- Abstract summary: We present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end composite set detection.
PST achieves state-of-the-art results among single-stage models, while nearly matching the results of custom-designed two-stage models.
- Score: 74.26037922682355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer vision applications such as visual relationship detection and
human-object interaction can be formulated as a composite (structured) set
detection problem in which both the parts (subject, object, and predicate) and
the sum (triplet as a whole) are to be detected in a hierarchical fashion. In
this paper, we present a new approach, denoted Part-and-Sum detection
Transformer (PST), to perform end-to-end composite set detection. Different
from existing Transformers in which queries are at a single level, we
simultaneously model the joint part and sum hypotheses/interactions with
composite queries and attention modules. We explicitly incorporate sum queries
to enable better modeling of the part-and-sum relations that are absent in the
standard Transformers. Our approach also uses novel tensor-based part queries
and vector-based sum queries, and models their joint interaction. We report
experiments on two vision tasks, visual relationship detection, and
human-object interaction, and demonstrate that PST achieves state-of-the-art
results among single-stage models, while nearly matching the results of
custom-designed two-stage models.
Related papers
- Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - A Co-Interactive Transformer for Joint Slot Filling and Intent Detection [61.109486326954205]
Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system.
Previous studies either model the two tasks separately or only consider the single information flow from intent to slot.
We propose a Co-Interactive Transformer to consider the cross-impact between the two tasks simultaneously.
arXiv Detail & Related papers (2020-10-08T10:16:52Z) - Attention-based Joint Detection of Object and Semantic Part [4.389917490809522]
Our model is created on top of two Faster-RCNN models that share their features to get enhanced representations of both.
Experiments on the PASCAL-Part 2010 dataset show that joint detection can simultaneously improve both object detection and part detection.
arXiv Detail & Related papers (2020-07-05T18:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.