A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2305.03602v2
- Date: Tue, 30 May 2023 02:33:12 GMT
- Title: A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation
- Authors: Liuyi Wang, Zongtao He, Jiagui Tang, Ronghao Dang, Naijia Wang,
Chengju Liu, Qijun Chen
- Abstract summary: Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
- Score: 3.809880620207714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) is a realistic but challenging task that
requires an agent to locate the target region using verbal and visual cues.
While significant advancements have been achieved recently, there are still two
broad limitations: (1) The explicit information mining for significant guiding
semantics concealed in both vision and language is still under-explored; (2)
The previously structured map method provides the average historical appearance
of visited nodes, while it ignores distinctive contributions of various images
and potent information retention in the reasoning process. This work proposes a
dual semantic-aware recurrent global-adaptive network (DSRG) to address the
above problems. First, DSRG proposes an instruction-guidance linguistic module
(IGL) and an appearance-semantics visual module (ASV) for boosting vision and
language semantic learning respectively. For the memory mechanism, a global
adaptive aggregation module (GAA) is devised for explicit panoramic observation
fusion, and a recurrent memory fusion module (RMF) is introduced to supply
implicit temporal hidden states. Extensive experimental results on the R2R and
REVERIE datasets demonstrate that our method achieves better performance than
existing methods. Code is available at https://github.com/CrystalSixone/DSRG.
Related papers
- Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments [19.818370526976974]
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI.
We introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks.
Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes.
arXiv Detail & Related papers (2024-09-04T08:30:03Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Learning Granularity-Unified Representations for Text-to-Image Person
Re-identification [29.04254233799353]
Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions.
Existing works usually ignore the difference in feature granularity between the two modalities.
We propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR.
arXiv Detail & Related papers (2022-07-16T01:26:10Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.