Related papers: Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration

Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration

URL: http://arxiv.org/abs/2511.05965v1
Date: Sat, 08 Nov 2025 10:50:43 GMT
Title: Adaptive Agent Selection and Interaction Network for Image-to-point cloud Registration
Authors: Zhixin Cheng, Xiaotian Yin, Jiacheng Deng, Bohao Liao, Yujia Chen, Xu Zhou, Baoqun Yin, Tianzhu Zhang,
Abstract summary: We propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module.<n>IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents.<n>RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness.
Score: 28.794550758318664
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration. To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness. Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.

Related papers

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification [14.549172375231729]
We propose STMI, a novel multi-modal learning framework consisting of three key components.<n>We demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios.
arXiv Detail & Related papers (2026-02-28T15:07:10Z)
Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z)
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z)
Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching [7.1469465755934785]
Weakly supervised text-to-person image matching is a crucial approach to reducing models' reliance on large-scale manually labeled samples.<n>We propose a dual-granularity identity association mechanism to predict complex one-to-many identity relationships.<n> Experimental results demonstrate that the proposed method substantially boosts cross-modal matching accuracy.
arXiv Detail & Related papers (2025-07-09T10:59:13Z)
SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning [17.104905795008555]
We propose a framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-worlds.<n>Our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead.
arXiv Detail & Related papers (2025-07-01T07:21:31Z)
Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a framework that generates multiple intermediate representations via dialogue refinements and DMs.<n>DAR results on par with finetuned I-TIR models, yet without incurring their tuning overhead.
arXiv Detail & Related papers (2025-01-26T03:29:18Z)
CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data.<n>Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates.<n>We propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM) This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature. We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z)
CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification [38.96033760300123]
Cross-modality transformer-based method (CMTR) for visible-infrared person re-identification task. We design the novel modality embeddings, which are fused with token embeddings to encode modalities' information. Our proposed CMTR model's performance significantly surpasses existing outstanding CNN-based methods.
arXiv Detail & Related papers (2021-10-18T03:12:59Z)
RGB-D Salient Object Detection with Cross-Modality Modulation and Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD) The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.