Estimating Extreme 3D Image Rotation with Transformer Cross-Attention
- URL: http://arxiv.org/abs/2303.02615v2
- Date: Fri, 8 Mar 2024 19:29:10 GMT
- Title: Estimating Extreme 3D Image Rotation with Transformer Cross-Attention
- Authors: Shay Dekel, Yosi Keller, Martin Cadik
- Abstract summary: We propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder to compute the cross-attention between the activation maps of the image pairs.
It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks.
- Score: 13.82735766201496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The estimation of large and extreme image rotation plays a key role in
multiple computer vision domains, where the rotated images are related by a
limited or a non-overlapping field of view. Contemporary approaches apply
convolutional neural networks to compute a 4D correlation volume to estimate
the relative rotation between image pairs. In this work, we propose a
cross-attention-based approach that utilizes CNN feature maps and a
Transformer-Encoder, to compute the cross-attention between the activation maps
of the image pairs, which is shown to be an improved equivalent of the 4D
correlation volume, used in previous works. In the suggested approach, higher
attention scores are associated with image regions that encode visual cues of
rotation. Our approach is end-to-end trainable and optimizes a simple
regression loss. It is experimentally shown to outperform contemporary
state-of-the-art schemes when applied to commonly used image rotation datasets
and benchmarks, and establishes a new state-of-the-art accuracy on these
datasets. We make our code publicly available.
Related papers
- 3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction [50.07071392673984]
Existing methods learn 3D rotations parametrized in the spatial domain using angles or quaternions.
We propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression.
Our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+.
arXiv Detail & Related papers (2024-11-01T12:50:38Z) - Distributed Stochastic Optimization of a Neural Representation Network for Time-Space Tomography Reconstruction [4.689071714940848]
4D time-space reconstruction of dynamic events or deforming objects using Xray computed tomography (CT) is an extremely ill-posed inverse problem.
Existing approaches assume that the object remains static for the duration of several tens or hundreds of X-ray projection measurement images.
We propose to perform a 4D time-space reconstruction using a distributed implicit neural representation network that is trained using a novel distributed training algorithm.
arXiv Detail & Related papers (2024-04-29T19:41:51Z) - Cross-domain and Cross-dimension Learning for Image-to-Graph
Transformers [50.576354045312115]
Direct image-to-graph transformation is a challenging task that solves object detection and relationship prediction in a single model.
We introduce a set of methods enabling cross-domain and cross-dimension transfer learning for image-to-graph transformers.
We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we pretrain our models on 2D satellite images before applying them to vastly different target domains in 2D and 3D.
arXiv Detail & Related papers (2024-03-11T10:48:56Z) - Plug-and-Play Regularization on Magnitude with Deep Priors for 3D Near-Field MIMO Imaging [0.0]
Near-field radar imaging systems are used in a wide range of applications such as concealed weapon detection and medical diagnosis.
We consider the problem of the three-dimensional (3D) complex-valued reflectivity by enforcing regularization on its magnitude.
arXiv Detail & Related papers (2023-12-26T12:25:09Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - Transformer-based Image Generation from Scene Graphs [11.443097632746763]
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
arXiv Detail & Related papers (2023-03-08T14:54:51Z) - Extreme Rotation Estimation using Dense Correlation Volumes [73.35119461422153]
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting.
We observe that, even when images do not overlap, there may be rich hidden cues as to their geometric relationship.
We propose a network design that can automatically learn such implicit cues by comparing all pairs of points between the two input images.
arXiv Detail & Related papers (2021-04-28T02:00:04Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - Fast Distance-based Anomaly Detection in Images Using an Inception-like
Autoencoder [16.157879279661362]
A convolutional autoencoder (CAE) is trained to extract a low-dimensional representation of the images.
We employ a distanced-based anomaly detector in the low-dimensional space of the learned representation for the images.
We find that our approach resulted in improved predictive performance.
arXiv Detail & Related papers (2020-03-12T16:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.