Related papers: Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

URL: http://arxiv.org/abs/2308.11561v5
Date: Thu, 14 Dec 2023 08:23:30 GMT
Title: Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation
Authors: Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang
Abstract summary: This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. It addresses the Aerial Navigation from Dialog History (andH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework.
Score: 10.25089706534778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/TG-GAT.

Related papers

CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking [20.14331144617911]
We introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity.<n>Experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast.
arXiv Detail & Related papers (2025-05-09T10:27:01Z)
Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction [102.70482302750897]
Aerial Vision-and-Language Navigation (Aerial VLN) aims to obtain an unmanned aerial vehicle agent to navigate aerial 3D environments following human instruction. Previous methods struggle to perform well due to the longer navigation path, more complicated 3D scenes, and the neglect of the interplay between vertical and horizontal actions. We propose a novel grid-based view selection framework that formulates aerial VLN action prediction as a grid-based view selection task.
arXiv Detail & Related papers (2025-03-14T05:20:43Z)
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark [15.405137983083875]
Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations. This paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions.
arXiv Detail & Related papers (2025-03-10T07:00:07Z)
SFTrack: A Robust Scale and Motion Adaptive Algorithm for Tracking Small and Fast Moving Objects [2.9803250365852443]
This paper addresses the problem of multi-object tracking in Unmanned Aerial Vehicle (UAV) footage. It plays a critical role in various UAV applications, including traffic monitoring systems and real-time suspect tracking by the police. We propose a new tracking strategy, which initiates the tracking of target objects from low-confidence detections.
arXiv Detail & Related papers (2024-10-26T05:09:20Z)
GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation [65.71524410114797]
GOAT-Bench is a benchmark for the universal navigation task GO to AnyThing (GOAT) In GOAT, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities.
arXiv Detail & Related papers (2024-04-09T20:40:00Z)
GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. We propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS) Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z)
Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids [69.98258892165767]
We present an aerial navigation task for the 2023 ICCV Conversation History. We propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) models.
arXiv Detail & Related papers (2023-08-27T10:32:52Z)
A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted Drones [12.439787085435661]
We propose A3D, an edge server assisted drone navigation framework. A3D can reduce end-to-end latency by 28.06% and extend the flight distance by up to 27.28% compared with non-adaptive solutions.
arXiv Detail & Related papers (2023-07-19T10:23:28Z)
SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking [12.447854608181833]
This work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation. A lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.
arXiv Detail & Related papers (2023-03-08T05:01:00Z)
TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos [57.92385818430939]
Drone-to-drone detection using visual feed has crucial applications, such as detecting drone collisions, detecting drone attacks, or coordinating flight with other drones. Existing methods are computationally costly, follow non-end-to-end optimization, and have complex multi-stage pipelines, making them less suitable for real-time deployment on edge devices. We propose a simple yet effective framework, itTransVisDrone, that provides an end-to-end solution with higher computational efficiency.
arXiv Detail & Related papers (2022-10-16T03:05:13Z)
1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022) [28.5740809300599]
We present a modular plan-and-control approach for the problem of Vision-and-Language Navigation in Continuous Environments (VLN-CE) Our model consists of three modules: the candidate waypoints predictor (CWP), the history enhanced planner and the tryout controller. Our model won the RxR-Habitat Competition 2022, with 48% and 90% relative improvements over existing methods on NDTW and SR metrics respectively.
arXiv Detail & Related papers (2022-06-23T10:36:53Z)
An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector. We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z)
Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision. We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z)
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes. Our proposed method combines visual features and 3D spatial representations to learn navigation policy. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.