GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2305.17102v2
- Date: Mon, 2 Oct 2023 16:23:03 GMT
- Title: GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation
- Authors: Jingyang Huo, Qiang Sun, Boyan Jiang, Haitao Lin, Yanwei Fu
- Abstract summary: We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
- Score: 52.65506307440127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing works solving Room-to-Room VLN problem only utilize RGB images
and do not consider local context around candidate views, which lack sufficient
visual cues about surrounding environment. Moreover, natural language contains
complex semantic information thus its correlations with visual inputs are hard
to model merely with cross attention. In this paper, we propose GeoVLN, which
learns Geometry-enhanced visual representation based on slot attention for
robust Visual-and-Language Navigation. The RGB images are compensated with the
corresponding depth maps and normal maps predicted by Omnidata as visual
inputs. Technically, we introduce a two-stage module that combine local slot
attention and CLIP model to produce geometry-enhanced representation from such
input. We employ V&L BERT to learn a cross-modal representation that
incorporate both language and vision informations. Additionally, a novel
multiway attention module is designed, encouraging different phrases of input
instruction to exploit the most related features from visual input. Extensive
experiments demonstrate the effectiveness of our newly designed modules and
show the compelling performance of the proposed method.
Related papers
- EarthMarker: Visual Prompt Learning for Region-level and Point-level Remote Sensing Imagery Comprehension [12.9701635989222]
The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
arXiv Detail & Related papers (2024-07-18T15:35:00Z) - ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples.
We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts [38.59120110371588]
We introduce a novel multimodal model capable of decoding arbitrary visual prompts.
This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow"
Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
arXiv Detail & Related papers (2023-12-01T18:59:56Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.