Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog
Navigation
- URL: http://arxiv.org/abs/2308.11561v5
- Date: Thu, 14 Dec 2023 08:23:30 GMT
- Title: Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog
Navigation
- Authors: Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang
- Abstract summary: This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023.
It addresses the Aerial Navigation from Dialog History (andH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination.
For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework.
- Score: 10.25089706534778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report details the methods of the winning entry of the AVDN Challenge in
ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog
History (ANDH) task, which requires a drone agent to associate dialog history
with aerial observations to reach the destination. For better cross-modal
grounding abilities of the drone agent, we propose a Target-Grounded
Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages
a graph-aware transformer to capture spatiotemporal dependency, which benefits
navigation state tracking and robust action planning. In addition,an auxiliary
visual grounding task is devised to boost the agent's awareness of referred
landmarks. Moreover, a hybrid augmentation strategy based on large language
models is utilized to mitigate data scarcity limitations. Our TG-GAT framework
won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the
baseline on SPL and SR metrics, respectively. The code is available at
https://github.com/yifeisu/TG-GAT.
Related papers
- SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects [2.9803250365852443]
This paper addresses the problem of multi-object tracking in Unmanned Aerial Vehicle (UAV) footage.
It plays a critical role in various UAV applications, including traffic monitoring systems and real-time suspect tracking by the police.
We propose a new tracking strategy, which initiates the tracking of target objects from low-confidence detections.
arXiv Detail & Related papers (2024-10-26T05:09:20Z) - GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT)
In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image.
We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z) - Multi-model fusion for Aerial Vision and Dialog Navigation based on
human attention aids [69.98258892165767]
We present an aerial navigation task for the 2023 ICCV Conversation History.
We propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) models.
arXiv Detail & Related papers (2023-08-27T10:32:52Z) - A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted
Drones [12.439787085435661]
We propose A3D, an edge server assisted drone navigation framework.
A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.
arXiv Detail & Related papers (2023-07-19T10:23:28Z) - SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking [12.447854608181833]
This work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking.
The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation.
A lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.
arXiv Detail & Related papers (2023-03-08T05:01:00Z) - TransVisDrone: Spatio-Temporal Transformer for Vision-based
Drone-to-Drone Detection in Aerial Videos [57.92385818430939]
Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones.
Existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices.
We propose a simple yet effective framework, itTransVisDrone, that provides an end-to-end solution with higher computational efficiency.
arXiv Detail & Related papers (2022-10-16T03:05:13Z) - 1st Place Solutions for RxR-Habitat Vision-and-Language Navigation
Competition (CVPR 2022) [28.5740809300599]
We present a modular plan-and-control approach for the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE)
Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller.
Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.
arXiv Detail & Related papers (2022-06-23T10:36:53Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision.
We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes.
With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.