Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations
- URL: http://arxiv.org/abs/2009.04965v3
- Date: Mon, 5 Apr 2021 07:48:10 GMT
- Title: Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations
- Authors: Meng-Jiun Chiou, Roger Zimmermann, Jiashi Feng
- Abstract summary: Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
- Score: 103.00383924074585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual relationship detection aims to reason over relationships among salient
objects in images, which has drawn increasing attention over the past few
years. Inspired by human reasoning mechanisms, it is believed that external
visual commonsense knowledge is beneficial for reasoning visual relationships
of objects in images, which is however rarely considered in existing methods.
In this paper, we propose a novel approach named Relational Visual-Linguistic
Bidirectional Encoder Representations from Transformers (RVL-BERT), which
performs relational reasoning with both visual and language commonsense
knowledge learned via self-supervised pre-training with multimodal
representations. RVL-BERT also uses an effective spatial module and a novel
mask attention module to explicitly capture spatial information among the
objects. Moreover, our model decouples object detection from visual
relationship recognition by taking in object names directly, enabling it to be
used on top of any object detection system. We show through quantitative and
qualitative experiments that, with the transferred knowledge and novel modules,
RVL-BERT achieves competitive results on two challenging visual relationship
detection datasets. The source code is available at
https://github.com/coldmanck/RVL-BERT.
Related papers
- End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories.
Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories.
We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z) - Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Video Relationship Detection Using Mixture of Experts [1.6574413179773761]
We introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts.
MoE-VRD identifies language triplets in the form of subject, predicate, object>s to extract relationships from visual processing.
Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-03-06T19:08:34Z) - Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction [6.798515070856465]
We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD)
Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR)
arXiv Detail & Related papers (2023-11-08T16:59:26Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z) - Exploiting Multi-Object Relationships for Detecting Adversarial Attacks
in Complex Scenes [51.65308857232767]
Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples.
Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks.
We develop a novel approach to perform context consistency checks using language models.
arXiv Detail & Related papers (2021-08-19T00:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.