Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding
- URL: http://arxiv.org/abs/2412.20467v1
- Date: Sun, 29 Dec 2024 13:45:11 GMT
- Title: Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding
- Authors: Alexander Blatt, Dietrich Klakow,
- Abstract summary: robustness of an architecture is particularly evident in edge cases.
We propose the multimodal call-sign-command recovery model ( CCR)
CCR architecture leads to an increase in the edge case performance of up to 15%.
- Score: 65.55175502273013
- License:
- Abstract: Operational machine-learning based assistant systems must be robust in a wide range of scenarios. This hold especially true for the air-traffic control (ATC) domain. The robustness of an architecture is particularly evident in edge cases, such as high word error rate (WER) transcripts resulting from noisy ATC recordings or partial transcripts due to clipped recordings. To increase the edge-case robustness of call-sign recognition and understanding (CRU), a core tasks in ATC speech processing, we propose the multimodal call-sign-command recovery model (CCR). The CCR architecture leads to an increase in the edge case performance of up to 15%. We demonstrate this on our second proposed architecture, CallSBERT. A CRU model that has less parameters, can be fine-tuned noticeably faster and is more robust during fine-tuning than the state of the art for CRU. Furthermore, we demonstrate that optimizing for edge cases leads to a significantly higher accuracy across a wide operational range.
Related papers
- Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting [107.4034346788744]
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions.
We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation.
arXiv Detail & Related papers (2025-01-08T20:11:09Z) - Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control [60.35553925189286]
We propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture.
We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.
arXiv Detail & Related papers (2024-06-19T21:11:01Z) - Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping [27.547461769425855]
Per-core clip-ping (PCC) can effectively mitigate unintended memorization in ASR models.
PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates.
arXiv Detail & Related papers (2024-06-04T06:34:33Z) - A One-Layer Decoder-Only Transformer is a Two-Layer RNN: With an Application to Certified Robustness [17.0639534812572]
ARC-Tran is a novel approach for verifying the robustness of decoder-only Transformers against arbitrary perturbation spaces.
Our evaluation shows that ARC-Tran trains models more robust to arbitrary perturbation spaces than those produced by existing techniques.
arXiv Detail & Related papers (2024-05-27T17:10:04Z) - Anatomy of Industrial Scale Multilingual ASR [13.491861238522421]
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system.
Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages.
arXiv Detail & Related papers (2024-04-15T14:48:43Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization [61.71504948770445]
We propose a novel channel pruning method via Class-Aware Trace Ratio Optimization (CATRO) to reduce the computational burden and accelerate the model inference.
We show that CATRO achieves higher accuracy with similar cost or lower cost with similar accuracy than other state-of-the-art channel pruning algorithms.
Because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.
arXiv Detail & Related papers (2021-10-21T06:26:31Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.