Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding
- URL: http://arxiv.org/abs/2412.20467v1
- Date: Sun, 29 Dec 2024 13:45:11 GMT
- Title: Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding
- Authors: Alexander Blatt, Dietrich Klakow,
- Abstract summary: robustness of an architecture is particularly evident in edge cases.<n>We propose the multimodal call-sign-command recovery model ( CCR)<n> CCR architecture leads to an increase in the edge case performance of up to 15%.
- Score: 65.55175502273013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Operational machine-learning based assistant systems must be robust in a wide range of scenarios. This hold especially true for the air-traffic control (ATC) domain. The robustness of an architecture is particularly evident in edge cases, such as high word error rate (WER) transcripts resulting from noisy ATC recordings or partial transcripts due to clipped recordings. To increase the edge-case robustness of call-sign recognition and understanding (CRU), a core tasks in ATC speech processing, we propose the multimodal call-sign-command recovery model (CCR). The CCR architecture leads to an increase in the edge case performance of up to 15%. We demonstrate this on our second proposed architecture, CallSBERT. A CRU model that has less parameters, can be fine-tuned noticeably faster and is more robust during fine-tuning than the state of the art for CRU. Furthermore, we demonstrate that optimizing for edge cases leads to a significantly higher accuracy across a wide operational range.
Related papers
- On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction [14.649184507551436]
We propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks.
We build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes.
Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features.
arXiv Detail & Related papers (2025-04-10T23:41:34Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)
We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.
We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.
We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z) - Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping [27.547461769425855]
Per-core clip-ping (PCC) can effectively mitigate unintended memorization in ASR models.
PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates.
arXiv Detail & Related papers (2024-06-04T06:34:33Z) - A One-Layer Decoder-Only Transformer is a Two-Layer RNN: With an Application to Certified Robustness [17.0639534812572]
ARC-Tran is a novel approach for verifying the robustness of decoder-only Transformers against arbitrary perturbation spaces.
Our evaluation shows that ARC-Tran trains models more robust to arbitrary perturbation spaces than those produced by existing techniques.
arXiv Detail & Related papers (2024-05-27T17:10:04Z) - Anatomy of Industrial Scale Multilingual ASR [13.491861238522421]
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system.
Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages.
arXiv Detail & Related papers (2024-04-15T14:48:43Z) - Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset
Selection [59.77647907277523]
Adversarial contrast learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks.
ACL needs tremendous running time to generate the adversarial variants of all training data.
This paper proposes a robustness-aware coreset selection (RCS) method to speed up ACL.
arXiv Detail & Related papers (2023-02-08T03:20:14Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization [61.71504948770445]
We propose a novel channel pruning method via Class-Aware Trace Ratio Optimization (CATRO) to reduce the computational burden and accelerate the model inference.
We show that CATRO achieves higher accuracy with similar cost or lower cost with similar accuracy than other state-of-the-art channel pruning algorithms.
Because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.
arXiv Detail & Related papers (2021-10-21T06:26:31Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.