Related papers: MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics

MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics

URL: http://arxiv.org/abs/2407.15663v1
Date: Mon, 22 Jul 2024 14:24:56 GMT
Title: MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics
Authors: Alexander Melekhin, Dmitry Yudin, Ilia Petryashin, Vitaly Bezuglyj,
Abstract summary: We study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors.
Score: 41.94295877935867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available: https://github.com/alexmelekhin/MSSPlace

Related papers

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion [2.7745600113170994]
We introduce the MultiSensor-Home dataset, a novel benchmark for comprehensive action recognition in home environments.<n>We also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method.
arXiv Detail & Related papers (2025-04-03T05:23:08Z)
Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing [19.344890308208555]
We propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations.
arXiv Detail & Related papers (2025-01-24T20:13:29Z)
GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
multimodal place recognition has gained increasing attention due to their ability to overcome weaknesses of uni sensor systems. We propose a 3D Gaussian-based multimodal place recognition neural network dubbed GSPR.
arXiv Detail & Related papers (2024-10-01T00:43:45Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs) We develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z)
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings. We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features. Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z)
MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images. Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata. We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z)
LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition [11.206532393178385]
We present a novel neural network named LCPR for robust multimodal place recognition. Our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance.
arXiv Detail & Related papers (2023-11-06T15:39:48Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision. Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training. We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z)
Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.