Related papers: GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers

URL: http://arxiv.org/abs/2408.02840v1
Date: Mon, 5 Aug 2024 21:29:33 GMT
Title: GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
Authors: Manu S Pillai, Mamshad Nayeem Rizve, Mubarak Shah,
Abstract summary: Cross-view video geo-localization aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Current CVGL methods use camera and odometry data, typically absent in real-world scenarios. We propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data.
Score: 53.80009458891537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-view video geo-localization (CVGL) aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images. Despite their promising performance, current CVGL methods face significant challenges. These methods use camera and odometry data, typically absent in real-world scenarios. They utilize multiple adjacent frames and various encoders for feature extraction, resulting in high computational costs. Moreover, these approaches independently predict each street-view frame's location, resulting in temporally inconsistent GPS trajectories. To address these challenges, in this work, we propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data. We introduce GeoAdapter, a transformer-adapter module designed to efficiently aggregate image-level representations and adapt them for video inputs. Specifically, we train a transformer encoder on video frames and aerial images, then freeze the encoder to optimize the GeoAdapter module to obtain video-level representation. To address temporally inconsistent trajectories, we introduce TransRetriever, an encoder-decoder transformer model that predicts GPS locations of street-view frames by encoding top-k nearest neighbor predictions per frame and auto-regressively decoding the best neighbor based on the previous frame's predictions. Our method's effectiveness is validated through extensive experiments, demonstrating state-of-the-art performance on benchmark datasets. Our code is available at https://github.com/manupillai308/GAReT.

Related papers

Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization [25.35432084094669]
We formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent textbfinter-platform matching process. To validate our approach, we introduce UniV, a new video-based geo-localization dataset.
arXiv Detail & Related papers (2024-11-20T01:52:49Z)
Surrogate Modeling of Trajectory Map-matching in Urban Road Networks using Transformer Sequence-to-Sequence Model [1.3812010983144802]
This paper introduces a deep-learning model, specifically the transformer-based encoder-decoder model, to perform as a surrogate for offline map-matching algorithms. The model is trained and evaluated using GPS traces collected in Manhattan, New York.
arXiv Detail & Related papers (2024-04-18T18:39:23Z)
Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring [70.06559269075352]
We propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation. To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z)
CATS v2: Hybrid encoders for robust medical segmentation [12.194439938007672]
Convolutional Neural Networks (CNNs) have exhibited strong performance in medical image segmentation tasks. However, due to the limited field of view of convolution kernel, it is hard for CNNs to fully represent global information. We propose CATS v2 with hybrid encoders, which better leverage both local and global information.
arXiv Detail & Related papers (2023-08-11T20:21:54Z)
RNTrajRec: Road Network Enhanced Trajectory Recovery with Spatial-Temporal Transformer [15.350300338463969]
We propose a road network enhanced transformer-based framework, namely RNTrajRec, for trajectory recovery. RNTrajRec first uses a graph model, namely GridGNN, to learn the embedding features of each road segment. It then introduces a Sub-Graph Generation module to represent each GPS point as a sub-graph structure of the road network around the GPS point.
arXiv Detail & Related papers (2022-11-23T11:28:32Z)
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization [81.70547404891099]
CNN-based methods for cross-view image geo-localization fail to model global correlation. We propose a pure transformer-based approach (TransGeo) to address these limitations. TransGeo achieves state-of-the-art results on both urban and rural datasets.
arXiv Detail & Related papers (2022-03-31T21:19:41Z)
PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object. PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects. We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z)
On Pursuit of Designing Multi-modal Transformer for Video Grounding [35.25323276744999]
Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. We propose a novel end-to-end multi-modal Transformer model, dubbed as bfGTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.
arXiv Detail & Related papers (2021-09-13T16:01:19Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)
Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.