Related papers: S2-Net: Self-supervision Guided Feature Representation Learning for Cross-Modality Images

S2-Net: Self-supervision Guided Feature Representation Learning for Cross-Modality Images

URL: http://arxiv.org/abs/2203.14581v1
Date: Mon, 28 Mar 2022 08:47:49 GMT
Title: S2-Net: Self-supervision Guided Feature Representation Learning for Cross-Modality Images
Authors: Shasha Mei
Abstract summary: Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible. In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline. We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Combining the respective advantages of cross-modality images can compensate for the lack of information in the single modality, which has attracted increasing attention of researchers into multi-modal image matching tasks. Meanwhile, due to the great appearance differences between cross-modality image pairs, it often fails to make the feature representations of correspondences as close as possible. In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline, originally proposed for visible images but adapted to work with cross-modality image pairs. To solve the consequent problem of optimization difficulties, we introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages. This novel strategy simulates image pairs in the same modality, which is also a useful guide for the training of cross-modality images. Notably, it does not require additional data but significantly improves the performance and is even workable for all methods of the detect-and-describe pipeline. Extensive experiments are conducted to evaluate the performance of the strategy we proposed, compared to both handcrafted and deep learning-based methods. Results show that our elegant formulation of combined optimization of supervised and self-supervised learning outperforms state-of-the-arts on RoadScene and RGB-NIR datasets.

Related papers

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z)
Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space. We propose an effective approach to narrow the gap between the two domains. It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z)
Symmetrical Bidirectional Knowledge Alignment for Zero-Shot Sketch-Based Image Retrieval [69.46139774646308]
This paper studies the problem of zero-shot sketch-based image retrieval (ZS-SBIR) It aims to use sketches from unseen categories as queries to match the images of the same category. We propose a novel Symmetrical Bidirectional Knowledge Alignment for zero-shot sketch-based image retrieval (SBKA)
arXiv Detail & Related papers (2023-12-16T04:50:34Z)
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one. We use features from the OpenAI CLIP model to tackle the considered task. We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z)
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination. In this paper, we enrich intra-modality and cross-modality relations for representation modeling. We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z)
Real-World Image Super-Resolution by Exclusionary Dual-Learning [98.36096041099906]
Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input. Deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets. We propose Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1-based cooperative learning.
arXiv Detail & Related papers (2022-06-06T13:28:15Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data. Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously. By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z)
Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation. We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.