Related papers: Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Related papers

SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network [0.0]
Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios.<n>This paper proposes SMR-Net, a self-attention-based multi-scale object detection algorithm.<n> Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly.
arXiv Detail & Related papers (2026-03-01T10:28:01Z)
Dual-domain Adaptation Networks for Realistic Image Super-resolution [81.34345637776408]
Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones.<n>Current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features.<n>We introduce a novel approach, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets.
arXiv Detail & Related papers (2025-11-21T12:57:23Z)
Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection [56.88160531995454]
Existing RGB-Event detection methods process the low-information regions of both modalities uniformly during feature extraction and fusion.<n>We propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features.<n>Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency.
arXiv Detail & Related papers (2025-09-04T04:18:46Z)
RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization [50.75654397516163]
We propose RelayFormer, a unified framework that adapts to varying resolutions and modalities.<n> RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens.<n>This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts.
arXiv Detail & Related papers (2025-08-13T03:35:28Z)
SAFE: Self-Adjustment Federated Learning Framework for Remote Sensing Collaborative Perception [12.303730216612877]
Existing distributed remote sensing models often rely on centralized training, resulting in data leakage, communication overhead, and reduced accuracy. We propose the textitSelf-Adjustment FEderated Learning framework to enhance collaborative sensing in remote sensing scenarios.
arXiv Detail & Related papers (2025-03-25T06:39:34Z)
Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification [13.238769012534922]
We propose a novel Cross-Modal Mapping (CMM) method for few-shot image classification. CMM aligns image features with the text feature space through linear transformation. It improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets.
arXiv Detail & Related papers (2024-12-28T10:40:21Z)
Enhancing Scene Coordinate Regression with Efficient Keypoint Detection and Sequential Information [26.934946734751442]
We propose an efficient and accurate Scene Coordinate Regression (SCR) system.<n>Compared to existing SCR methods, we propose a unified architecture for both scene encoding and salient keypoint detection.<n> Comprehensive experiments conducted across indoor and outdoor datasets demonstrate that the proposed system outperforms state-of-the-art (SOTA) SCR methods.
arXiv Detail & Related papers (2024-12-09T13:39:18Z)
Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR. Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z)
AMMUNet: Multi-Scale Attention Map Merging for Remote Sensing Image Segmentation [4.618389486337933]
We propose AMMUNet, a UNet-based framework that employs multi-scale attention map merging. The proposed AMMM effectively combines multi-scale attention maps into a unified representation using a fixed mask template. We show that our approach achieves remarkable mean intersection over union (mIoU) scores of 75.48% on the Vaihingen dataset and an exceptional 77.90% on the Potsdam dataset.
arXiv Detail & Related papers (2024-04-20T15:23:15Z)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z)
FuseFormer: A Transformer for Visual and Thermal Image Fusion [3.6064695344878093]
We propose a novel methodology for the image fusion problem that mitigates the limitations associated with using classical evaluation metrics as loss functions. Our approach integrates a transformer-based multi-scale fusion strategy that adeptly addresses local and global context information. Our proposed method, along with the novel loss function definition, demonstrates superior performance compared to other competitive fusion algorithms.
arXiv Detail & Related papers (2024-02-01T19:40:39Z)
Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution [13.894645293832044]
Transformer-based models have shown competitive performance in remote sensing image super-resolution (RSISR) We propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages.
arXiv Detail & Related papers (2023-07-06T13:19:06Z)
Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. We combine the RG-SA with local self-attention to enhance the exploitation of the global context. Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z)
Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining. A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery. Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z)
Cross-modal Local Shortest Path and Global Enhancement for Visible-Thermal Person Re-Identification [2.294635424666456]
We propose the Cross-modal Local Shortest Path and Global Enhancement (CM-LSP-GE) modules,a two-stream network based on joint learning of local and global features. The experimental results on two typical datasets show that our model is obviously superior to the most state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T10:27:22Z)
Real-World Image Super-Resolution by Exclusionary Dual-Learning [98.36096041099906]
Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input. Deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets. We propose Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1-based cooperative learning.
arXiv Detail & Related papers (2022-06-06T13:28:15Z)
Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information [15.32353270625554]
Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images. We first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels. Experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task.
arXiv Detail & Related papers (2022-04-21T03:18:09Z)
Dual-Flow Transformation Network for Deformable Image Registration with Region Consistency Constraint [95.30864269428808]
Current deep learning (DL)-based image registration approaches learn the spatial transformation from one image to another by leveraging a convolutional neural network. We present a novel dual-flow transformation network with region consistency constraint which maximizes the similarity of ROIs within a pair of images. Experiments on four public 3D MRI datasets show that the proposed method achieves the best registration performance in accuracy and generalization.
arXiv Detail & Related papers (2021-12-04T05:30:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.