AVR: Attention based Salient Visual Relationship Detection
- URL: http://arxiv.org/abs/2003.07012v1
- Date: Mon, 16 Mar 2020 04:12:39 GMT
- Title: AVR: Attention based Salient Visual Relationship Detection
- Authors: Jianming Lv, Qinzhe Xiao, Jiajie Zhong
- Abstract summary: Visual relationship detection aims to locate objects in images and recognize the relationships between objects.
Traditional methods treat all observed relationships in an image equally, which causes a relatively poor performance in the detection tasks on complex images with abundant visual objects and various relationships.
We propose an attention based model, namely, to achieve salient visual relationships based on both local and global context of the relationships.
- Score: 5.844015313757266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual relationship detection aims to locate objects in images and recognize
the relationships between objects. Traditional methods treat all observed
relationships in an image equally, which causes a relatively poor performance
in the detection tasks on complex images with abundant visual objects and
various relationships. To address this problem, we propose an attention based
model, namely AVR, to achieve salient visual relationships based on both local
and global context of the relationships. Specifically, AVR recognizes
relationships and measures the attention on the relationships in the local
context of an input image by fusing the visual features, semantic and spatial
information of the relationships. AVR then applies the attention to assign
important relationships with larger salient weights for effective information
filtering. Furthermore, AVR is integrated with the priori knowledge in the
global context of image datasets to improve the precision of relationship
prediction, where the context is modeled as a heterogeneous graph to measure
the priori probability of relationships based on the random walk algorithm.
Comprehensive experiments are conducted to demonstrate the effectiveness of AVR
in several real-world image datasets, and the results show that AVR outperforms
state-of-the-art visual relationship detection methods significantly by up to
$87.5\%$ in terms of recall.
Related papers
- EGTR: Extracting Graph from Transformer for Scene Graph Generation [5.935927309154952]
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects.
We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder.
We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets.
arXiv Detail & Related papers (2024-04-02T16:20:02Z) - Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Tensor Composition Net for Visual Relationship Prediction [115.14829858763399]
We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
arXiv Detail & Related papers (2020-12-10T06:27:20Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.