RelTransformer: Balancing the Visual Relationship Detection from Local
Context, Scene and Memory
- URL: http://arxiv.org/abs/2104.11934v1
- Date: Sat, 24 Apr 2021 12:04:04 GMT
- Title: RelTransformer: Balancing the Visual Relationship Detection from Local
Context, Scene and Memory
- Authors: Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed
Elhoseiny
- Abstract summary: We propose a novel framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels.
Our model significantly improves the accuracy of GQA-LT by 27.4% upon the best baselines on tail-relationship prediction.
- Score: 24.085223165006212
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual relationship recognition (VRR) is a fundamental scene understanding
task. The structure that VRR provides is essential to improve the AI
interpretability in downstream tasks such as image captioning and visual
question answering. Several recent studies showed that the long-tail problem in
VRR is even more critical than that in object recognition due to the
compositional complexity and structure. To overcome this limitation, we propose
a novel transformer-based framework, dubbed as RelTransformer, which performs
relationship prediction using rich semantic features from multiple image
levels. We assume that more abundantcon textual features can generate more
accurate and discriminative relationships, which can be useful when sufficient
training data are lacking. The key feature of our model is its ability to
aggregate three different-level features (local context, scene, and
dataset-level) to compositionally predict the visual relationship. We evaluate
our model on the visual genome and two "long-tail" VRR datasets, GQA-LT and
VG8k-LT. Extensive experiments demonstrate that our RelTransformer could
improve over the state-of-the-art baselines on all the datasets. In addition,
our model significantly improves the accuracy of GQA-LT by 27.4% upon the best
baselines on tail-relationship prediction. Our code is available in
https://github.com/Vision-CAIR/RelTransformer.
Related papers
- A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions [6.231370972617915]
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts.
Existing vision-language alignment models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task.
We leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object)
arXiv Detail & Related papers (2023-11-28T18:55:37Z) - RLIPv2: Fast Scaling of Relational Language-Image Pre-training [53.21796397618875]
We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data.
Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding.
RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
arXiv Detail & Related papers (2023-08-18T07:17:09Z) - When and why vision-language models behave like bags-of-words, and what
to do about it? [39.90099818890488]
We create the Attribution, Relation, and Order benchmark to evaluate the ability of VLMs to understand different types of relationships, attributes, and order.
ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.
We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity.
arXiv Detail & Related papers (2022-10-04T22:13:25Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Relation Transformer Network [25.141472361426818]
We propose a novel transformer formulation for scene graph generation and relation prediction.
We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges.
Our relation prediction module classifies the directed relation from the learned node and edge embedding.
arXiv Detail & Related papers (2020-04-13T20:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.