Self-Supervised Spatial Correspondence Across Modalities
- URL: http://arxiv.org/abs/2506.03148v1
- Date: Tue, 03 Jun 2025 17:59:45 GMT
- Title: Self-Supervised Spatial Correspondence Across Modalities
- Authors: Ayush Shrivastava, Andrew Owens,
- Abstract summary: We present a method for finding cross-modal space-time correspondences.<n>Given two images, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene.
- Score: 17.50529887238381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.
Related papers
- Semantic RGB-D Image Synthesis [22.137419841504908]
We introduce semantic RGB-D image synthesis to address this problem.
Current approaches, however, are uni-modal and cannot cope with multi-modal data.
We propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information.
arXiv Detail & Related papers (2023-08-22T11:16:24Z) - Clothes Grasping and Unfolding Based on RGB-D Semantic Segmentation [21.950751953721817]
We propose a novel Bi-directional Fractal Cross Fusion Network (BiFCNet) for semantic segmentation.
We use RGB images with rich color features as input to our network in which the Fractal Cross Fusion module fuses RGB and depth data.
To reduce the cost of real data collection, we propose a data augmentation method based on an adversarial strategy.
arXiv Detail & Related papers (2023-05-05T03:21:55Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - A Geometrically Constrained Point Matching based on View-invariant
Cross-ratios, and Homography [2.050924050557755]
A geometrically constrained algorithm is proposed to verify the correctness of initially matched SIFT keypoints based on view-invariant cross-ratios (CRs)
By randomly forming pentagons from these keypoints and matching their shape and location among images with CRs, robust planar region estimation can be achieved efficiently.
Experimental results show that satisfactory results can be obtained for various scenes with single as well as multiple planar regions.
arXiv Detail & Related papers (2022-11-06T01:55:35Z) - RGB-Multispectral Matching: Dataset, Learning Methodology, Evaluation [49.28588927121722]
We address the problem of registering synchronized color (RGB) and multi-spectral (MS) images featuring very different resolution by solving stereo matching correspondences.
We introduce a novel RGB-MS dataset framing 13 different scenes in indoor environments and providing a total of 34 image pairs annotated with semi-dense, high-resolution ground-truth labels.
To tackle the task, we propose a deep learning architecture trained in a self-supervised manner by exploiting a further RGB camera.
arXiv Detail & Related papers (2022-06-14T17:59:59Z) - Semantic-Sparse Colorization Network for Deep Exemplar-based
Colorization [23.301799487207035]
Exemplar-based colorization approaches rely on reference image to provide plausible colors for target gray-scale image.
We propose Semantic-Sparse Colorization Network (SSCN) to transfer both the global image style and semantic-related colors to the gray-scale image.
Our network can perfectly balance the global and local colors while alleviating the ambiguous matching problem.
arXiv Detail & Related papers (2021-12-02T15:35:10Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Extreme Rotation Estimation using Dense Correlation Volumes [73.35119461422153]
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting.
We observe that, even when images do not overlap, there may be rich hidden cues as to their geometric relationship.
We propose a network design that can automatically learn such implicit cues by comparing all pairs of points between the two input images.
arXiv Detail & Related papers (2021-04-28T02:00:04Z) - A Similarity Inference Metric for RGB-Infrared Cross-Modality Person
Re-identification [66.49212581685127]
Cross-modality person re-identification (re-ID) is a challenging task due to the large discrepancy between IR and RGB modalities.
Existing methods address this challenge typically by aligning feature distributions or image styles across modalities.
This paper presents a novel similarity inference metric (SIM) that exploits the intra-modality sample similarities to circumvent the cross-modality discrepancy.
arXiv Detail & Related papers (2020-07-03T05:28:13Z) - RANSAC-Flow: generic two-stage image alignment [53.11926395028508]
We show that a simple unsupervised approach performs surprisingly well across a range of tasks.
Despite its simplicity, our method shows competitive results on a range of tasks and datasets.
arXiv Detail & Related papers (2020-04-03T12:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.