Related papers: Multimodal Deep Learning for ATCO Command Lifecycle Modeling and Workload Prediction

Multimodal Deep Learning for ATCO Command Lifecycle Modeling and Workload Prediction

URL: http://arxiv.org/abs/2509.10522v1
Date: Thu, 04 Sep 2025 02:28:41 GMT
Title: Multimodal Deep Learning for ATCO Command Lifecycle Modeling and Workload Prediction
Authors: Kaizhen Tan,
Abstract summary: This paper proposes a multimodal deep learning framework to estimate two key parameters in the ATCO command lifecycle.<n>A CNN-Transformer ensemble model was developed for accurate, generalizable, and interpretable predictions.<n>By linking trajectories to voice commands, this work offers the first model of its kind to support intelligent command generation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Air traffic controllers (ATCOs) issue high-intensity voice commands in dense airspace, where accurate workload modeling is critical for safety and efficiency. This paper proposes a multimodal deep learning framework that integrates structured data, trajectory sequences, and image features to estimate two key parameters in the ATCO command lifecycle: the time offset between a command and the resulting aircraft maneuver, and the command duration. A high-quality dataset was constructed, with maneuver points detected using sliding window and histogram-based methods. A CNN-Transformer ensemble model was developed for accurate, generalizable, and interpretable predictions. By linking trajectories to voice commands, this work offers the first model of its kind to support intelligent command generation and provides practical value for workload assessment, staffing, and scheduling.

Related papers

Hybrid Distillation with CoT Guidance for Edge-Drone Control Code Generation [18.74352644644387]
This paper proposes an integrated approach that combines knowledge distillation, chain-of-thought guidance, and supervised fine-tuning for UAV multi-SDK control tasks.<n> Experimental results indicate that the distilled lightweight model maintains high code generation accuracy while achieving significant improvements in deployment and inference efficiency.
arXiv Detail & Related papers (2026-01-13T10:31:09Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models [17.488420164181463]
This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs) Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
arXiv Detail & Related papers (2023-12-06T15:14:30Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM) ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z)
ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
We propose a method for extracting long sequence representations for embodied navigation. We train our model using vector-quantized predictions of future states conditioned on current actions. A key property of our approach is that the model is pre-trained without any explicit reward signal.
arXiv Detail & Related papers (2023-04-05T17:58:33Z)
Fully End-to-end Autonomous Driving with Semantic Depth Cloud Mapping and Multi-Agent [2.512827436728378]
We propose a novel deep learning model trained with end-to-end and multi-task learning manners to perform both perception and control tasks simultaneously. The model is evaluated on CARLA simulator with various scenarios made of normal-adversarial situations and different weathers to mimic real-world conditions.
arXiv Detail & Related papers (2022-04-12T03:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.