Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
- URL: http://arxiv.org/abs/2512.18613v1
- Date: Sun, 21 Dec 2025 06:16:20 GMT
- Title: Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
- Authors: Saeideh Yousefzadeh, Hamidreza Pourreza,
- Abstract summary: Text2Graph VPR converts image sequences into textual scene descriptions.<n> Scene graphs capture objects, attributes and pairwise relations.<n>We demonstrate robust retrieval under severe appearance shifts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
Related papers
- SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z) - Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval [1.51422963961219]
We present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework.<n>Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches.
arXiv Detail & Related papers (2025-05-21T11:56:09Z) - Vision Graph Prompting via Semantic Low-Rank Decomposition [10.223578525761617]
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures.<n>To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential.<n>We propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures.
arXiv Detail & Related papers (2025-05-07T04:29:29Z) - A Graph-Based Framework for Interpretable Whole Slide Image Analysis [86.37618055724441]
We develop a framework that transforms whole-slide images into biologically-informed graph representations.<n>Our approach builds graph nodes from tissue regions that respect natural structures, not arbitrary grids.<n>We demonstrate strong performance on challenging cancer staging and survival prediction tasks.
arXiv Detail & Related papers (2025-03-14T20:15:04Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Closing the Loop: Graph Networks to Unify Semantic Objects and Visual
Features for Multi-object Scenes [2.236663830879273]
Loop Closure Detection (LCD) is essential to minimize drift when recognizing previously visited places.
Visual Bag-of-Words (vBoW) has been an LCD algorithm of choice for many state-of-the-art SLAM systems.
This paper proposes SymbioLCD2, which creates a unified graph structure to integrate semantic objects and visual features symbiotically.
arXiv Detail & Related papers (2022-09-24T00:42:33Z) - Scene Graph Embeddings Using Relative Similarity Supervision [4.137464623395376]
We employ a graph convolutional network to exploit structure in scene graphs and produce image embeddings useful for semantic image retrieval.
We propose a novel loss function that operates on pairs of similar and dissimilar images and imposes relative ordering between them in embedding space.
We demonstrate that this Ranking loss, coupled with an intuitive triple sampling strategy, leads to robust representations that outperform well-known contrastive losses on the retrieval task.
arXiv Detail & Related papers (2021-04-06T09:13:05Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.