LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition
- URL: http://arxiv.org/abs/2407.06730v1
- Date: Tue, 9 Jul 2024 10:15:31 GMT
- Title: LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition
- Authors: Teng Wang, Lingquan Meng, Lei Cheng, Changyin Sun,
- Abstract summary: We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
- Score: 17.388776062997813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment variations. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion. Furthermore, LVLMs will inevitably produces some inaccurate descriptions, making it even harder. To tackle these challenges, we propose a novel multi-modal VPR solution. It first adapts pre-trained visual and language foundation models to VPR for extracting image and text features, which are then fed into the feature combiner to enhance each other. As the main component, the feature combiner first propose a token-wise attention block to adaptively recalibrate text tokens according to their relevance to the image data, and then develop an efficient cross-attention fusion module to propagate information across different modalities. The enhanced multi-modal features are compressed into the feature descriptor for performing retrieval. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly smaller image descriptor dimension.
Related papers
- FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Progressive Multi-modal Conditional Prompt Tuning [92.50645776024624]
Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting.
We propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT)
ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information.
arXiv Detail & Related papers (2024-04-18T02:40:31Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training.
VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.