Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog
Navigation
- URL: http://arxiv.org/abs/2308.11561v5
- Date: Thu, 14 Dec 2023 08:23:30 GMT
- Title: Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog
Navigation
- Authors: Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang
- Abstract summary: This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023.
It addresses the Aerial Navigation from Dialog History (andH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination.
For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework.
- Score: 10.25089706534778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report details the methods of the winning entry of the AVDN Challenge in
ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog
History (ANDH) task, which requires a drone agent to associate dialog history
with aerial observations to reach the destination. For better cross-modal
grounding abilities of the drone agent, we propose a Target-Grounded
Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages
a graph-aware transformer to capture spatiotemporal dependency, which benefits
navigation state tracking and robust action planning. In addition,an auxiliary
visual grounding task is devised to boost the agent's awareness of referred
landmarks. Moreover, a hybrid augmentation strategy based on large language
models is utilized to mitigate data scarcity limitations. Our TG-GAT framework
won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the
baseline on SPL and SR metrics, respectively. The code is available at
https://github.com/yifeisu/TG-GAT.
Related papers
- A Cross-Scene Benchmark for Open-World Drone Active Tracking [54.235808061746525]
Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations.
We propose a unified cross-scene cross-domain benchmark for open-world drone active tracking called DAT.
We also propose a reinforcement learning-based drone tracking method called R-VAT.
arXiv Detail & Related papers (2024-12-01T09:37:46Z) - GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment.
We propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS)
Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z) - Multi-model fusion for Aerial Vision and Dialog Navigation based on
human attention aids [69.98258892165767]
We present an aerial navigation task for the 2023 ICCV Conversation History.
We propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) models.
arXiv Detail & Related papers (2023-08-27T10:32:52Z) - A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted
Drones [12.439787085435661]
We propose A3D, an edge server assisted drone navigation framework.
A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.
arXiv Detail & Related papers (2023-07-19T10:23:28Z) - SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking [12.447854608181833]
This work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking.
The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation.
A lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.
arXiv Detail & Related papers (2023-03-08T05:01:00Z) - TransVisDrone: Spatio-Temporal Transformer for Vision-based
Drone-to-Drone Detection in Aerial Videos [57.92385818430939]
Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones.
Existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices.
We propose a simple yet effective framework, itTransVisDrone, that provides an end-to-end solution with higher computational efficiency.
arXiv Detail & Related papers (2022-10-16T03:05:13Z) - 1st Place Solutions for RxR-Habitat Vision-and-Language Navigation
Competition (CVPR 2022) [28.5740809300599]
We present a modular plan-and-control approach for the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE)
Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller.
Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.
arXiv Detail & Related papers (2022-06-23T10:36:53Z) - Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision.
We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes.
With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.