Fixed-size Objects Encoding for Visual Relationship Detection
- URL: http://arxiv.org/abs/2005.14600v1
- Date: Fri, 29 May 2020 14:36:25 GMT
- Title: Fixed-size Objects Encoding for Visual Relationship Detection
- Authors: Hengyue Pan, Xin Niu, Rongchun Li, Siqi Shen, Yong Dou
- Abstract summary: We propose a fixed-size object encoding method (FOE-VRD) to improve performance of visual relationship detection tasks.
It uses one fixed-size vector to encoding all objects in each input image to assist the process of relationship detection.
Experimental results on VRD database show that the proposed method works well on both predicate classification and relationship detection.
- Score: 16.339394922532282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a fixed-size object encoding method (FOE-VRD) to
improve performance of visual relationship detection tasks. Comparing with
previous methods, FOE-VRD has an important feature, i.e., it uses one
fixed-size vector to encoding all objects in each input image to assist the
process of relationship detection. Firstly, we use a regular convolution neural
network as a feature extractor to generate high-level features of input images.
Then, for each relationship triplet in input images, i.e.,
$<$subject-predicate-object$>$, we apply ROI-pooling to get feature vectors of
two regions on the feature maps that corresponding to bounding boxes of the
subject and object. Besides the subject and object, our analysis implies that
the results of predicate classification may also related to the rest objects in
input images (we call them background objects). Due to the variable number of
background objects in different images and computational costs, we cannot
generate feature vectors for them one-by-one by using ROI pooling technique.
Instead, we propose a novel method to encode all background objects in each
image by using one fixed-size vector (i.e., FBE vector). By concatenating the 3
vectors we generate above, we successfully encode the objects using one
fixed-size vector. The generated feature vector is then feed into a fully
connected neural network to get predicate classification results. Experimental
results on VRD database (entire set and zero-shot tests) show that the proposed
method works well on both predicate classification and relationship detection.
Related papers
- GOReloc: Graph-based Object-Level Relocalization for Visual SLAM [17.608119427712236]
This article introduces a novel method for object-level relocalization of robotic systems.
It determines the pose of a camera sensor by robustly associating the object detections in the current frame with 3D objects in a lightweight object-level map.
arXiv Detail & Related papers (2024-08-15T03:54:33Z) - Ablation Study to Clarify the Mechanism of Object Segmentation in
Multi-Object Representation Learning [3.921076451326107]
Multi-object representation learning aims to represent complex real-world visual input using the composition of multiple objects.
It is not clear how previous methods have achieved the appropriate segmentation of individual objects.
Most of the previous methods regularize the latent vectors using a Variational Autoencoder (VAE)
arXiv Detail & Related papers (2023-10-05T02:59:48Z) - Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem.
In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images.
The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation [2.861848675707602]
We present a new single-stage architecture called CASAPose.
It determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass.
It is fast and memory efficient, and achieves high accuracy for multiple objects.
arXiv Detail & Related papers (2022-10-11T10:20:01Z) - Disentangled Representation Learning Using ($\beta$-)VAE and GAN [0.0]
The dSprite dataset provided the desired features for the required experiments.
After training the VAE combined with a Generative Adversarial Network (GAN), each dimension of the hidden vector was disrupted to explore the disentanglement in each dimension.
arXiv Detail & Related papers (2022-08-09T05:37:06Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning.
We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task.
Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z) - Deep ensembles based on Stochastic Activation Selection for Polyp
Segmentation [82.61182037130406]
This work deals with medical image segmentation and in particular with accurate polyp detection and segmentation during colonoscopy examinations.
Basic architecture in image segmentation consists of an encoder and a decoder.
We compare some variant of the DeepLab architecture obtained by varying the decoder backbone.
arXiv Detail & Related papers (2021-04-02T02:07:37Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.