Related papers: OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition

OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition

URL: http://arxiv.org/abs/2507.14477v1
Date: Sat, 19 Jul 2025 04:29:43 GMT
Title: OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition
Authors: Zhenyu Li, Tianyi Shang, Pengjie Xu, Ruirui Zhang, Fanchen Kong,
Abstract summary: This paper presents OptiCorNet, a novel sequence modeling framework.<n>It unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module.<n>Our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.
Score: 2.3093110834423616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Place Recognition (VPR) in dynamic and perceptually aliased environments remains a fundamental challenge for long-term localization. Existing deep learning-based solutions predominantly focus on single-frame embeddings, neglecting the temporal coherence present in image sequences. This paper presents OptiCorNet, a novel sequence modeling framework that unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module. Central to our approach is a lightweight 1D convolutional encoder combined with a learnable differential temporal operator, termed Differentiable Sequence Delta (DSD), which jointly captures short-term spatial context and long-range temporal transitions. The DSD module models directional differences across sequences via a fixed-weight differencing kernel, followed by an LSTM-based refinement and optional residual projection, yielding compact, discriminative descriptors robust to viewpoint and appearance shifts. To further enhance inter-class separability, we incorporate a quadruplet loss that optimizes both positive alignment and multi-negative divergence within each batch. Unlike prior VPR methods that treat temporal aggregation as post-processing, OptiCorNet learns sequence-level embeddings directly, enabling more effective end-to-end place recognition. Comprehensive evaluations on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.

Related papers

DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection [7.117824587276951]
This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Stemporal Framework (DAMS), which is based on multilevel feature and decoupling fusion.<n>The main processing path integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM)
arXiv Detail & Related papers (2025-07-28T08:42:00Z)
IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation [77.06177202334398]
We identify two critical challenges in CISS that contribute to semantic drift and degrade performance.<n>First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages.<n>Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results.
arXiv Detail & Related papers (2025-02-07T12:19:37Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation [18.374354844446962]
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. We propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph.
arXiv Detail & Related papers (2024-05-27T08:18:41Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
Long-Term Invariant Local Features via Implicit Cross-Domain Correspondences [79.21515035128832]
We conduct a thorough analysis of the performance of current state-of-the-art feature extraction networks under various domain changes. We propose a novel data-centric method, Implicit Cross-Domain Correspondences (iCDC) iCDC represents the same environment with multiple Neural Radiance Fields, each fitting the scene under individual visual domains.
arXiv Detail & Related papers (2023-11-06T18:53:01Z)
FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space [7.324708513042455]
This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals. It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin.
arXiv Detail & Related papers (2023-10-30T22:55:29Z)
Local-Global Temporal Difference Learning for Satellite Video Super-Resolution [53.03380679343968]
We propose to exploit the well-defined temporal difference for efficient and effective temporal compensation.<n>To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies.<n> Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-10T07:04:40Z)
RetiFluidNet: A Self-Adaptive and Multi-Attention Deep Convolutional Network for Retinal OCT Fluid Segmentation [3.57686754209902]
Quantification of retinal fluids is necessary for OCT-guided treatment management. New convolutional neural architecture named RetiFluidNet is proposed for multi-class retinal fluid segmentation. Model benefits from hierarchical representation learning of textural, contextual, and edge features.
arXiv Detail & Related papers (2022-09-26T07:18:00Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network [35.932447204088845]
We develop an effective feature alignment network tailored to video saliency prediction (V) The network learns to align the features of the neighboring frames to the reference one in a coarse-to-fine manner. The proposed model is trained end-to-end without any post processing.
arXiv Detail & Related papers (2020-01-02T02:05:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.