Related papers: GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

URL: http://arxiv.org/abs/2507.10473v2
Date: Fri, 25 Jul 2025 21:08:55 GMT
Title: GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
Authors: David G. Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, Mubarak Shah,
Abstract summary: GT-Loc is a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image.<n>Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space.
Score: 48.09196906704634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.

Related papers

REPLAY: Modeling Time-Varying Temporal Regularities of Human Mobility for Location Prediction over Sparse Trajectories [7.493786214342181]
We propose REPLAY, a general RNN architecture learning to capture the time-varying temporal regularities for location prediction.<n>Specifically, REPLAY not only resorts to distances in sparse trajectories to search for the informative hidden past states, but also accommodates the time-varying temporal regularities.<n>Results show that REPLAY consistently and significantly outperforms state-of-the-art methods by 7.7%-10.5% in the location prediction task.
arXiv Detail & Related papers (2024-02-26T05:28:36Z)
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z)
Cross-View Visual Geo-Localization for Outdoor Augmented Reality [11.214903134756888]
We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-28T01:58:03Z)
TempSAL -- Uncovering Temporal Information for Deep Saliency Prediction [64.63645677568384]
We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our code will be publicly available on GitHub.
arXiv Detail & Related papers (2023-01-05T22:10:16Z)
Geo-Adaptive Deep Spatio-Temporal predictive modeling for human mobility [5.864710987890994]
Deep GA-vLS assumes data to be of fixed and regular shaped tensor shaped and face challenges of handling irregular data. We present a novel geo-aware enabled learning operation based on a novel data structure for dependencies while maintaining the recurrent mechanism.
arXiv Detail & Related papers (2022-11-27T16:51:28Z)
Cross-View Image Sequence Geo-localization [6.555961698070275]
Cross-view geo-localization aims to estimate the GPS location of a query ground-view image. Recent approaches use panoramic ground-view images to increase the range of visibility. We present the first cross-view geo-localization method that works on a sequence of limited Field-Of-View images.
arXiv Detail & Related papers (2022-10-25T19:46:18Z)
Accurate 3-DoF Camera Geo-Localization via Ground-to-Satellite Image Matching [102.39635336450262]
We address the problem of ground-to-satellite image geo-localization by matching a query image captured at the ground level against a large-scale database with geotagged satellite images. Our new method is able to achieve the fine-grained location of a query image, up to pixel size precision of the satellite image.
arXiv Detail & Related papers (2022-03-26T20:10:38Z)
Geography-Aware Self-Supervised Learning [79.4009241781968]
We show that due to their different characteristics, a non-trivial gap persists between contrastive and supervised learning on standard benchmarks. We propose novel training methods that exploit the spatially aligned structure of remote sensing data. Our experiments show that our proposed method closes the gap between contrastive and supervised learning on image classification, object detection and semantic segmentation for remote sensing.
arXiv Detail & Related papers (2020-11-19T17:29:13Z)
Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis [88.80710311624101]
We propose a semi-automated approach to generate reference poses based on feature matching between renderings of a 3D model and real images via learned features. We significantly improve the nighttime reference poses of the popular Aachen Day-Night dataset, showing that state-of-the-art visual localization methods perform better (up to $47%$) than predicted by the original reference poses.
arXiv Detail & Related papers (2020-05-11T15:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.