Related papers: Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification

Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification

URL: http://arxiv.org/abs/2512.16271v1
Date: Thu, 18 Dec 2025 07:40:44 GMT
Title: Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification
Authors: Geofrey Owino, Bernard Shibwabo Kasamani, Ahmed M. Abdelmoniem, Edem Wornyo,
Abstract summary: We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification.<n>The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework.<n>The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent.
Score: 5.764453198495989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate and interpretable classification of infant cry paralinguistics is essential for early detection of neonatal distress and clinical decision support. However, many existing deep learning methods rely on correlation-driven acoustic representations, which makes them vulnerable to noise, spurious cues, and domain shifts across recording environments. We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification. The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework. DACH-TIC employs a structured transformer backbone with local token-level and global semantic encoders, augmented by causal attention masking and controlled perturbation training to approximate counterfactual acoustic variations. A domain-adversarial objective promotes environment-invariant representations, while multi-task learning jointly optimizes cry type recognition, distress intensity estimation, and causal relevance prediction. The model is evaluated on the Baby Chillanto and Donate-a-Cry datasets, with ESC-50 environmental noise overlays for domain augmentation. Experimental results show that DACH-TIC outperforms state-of-the-art baselines, including HTS-AT and SE-ResNet Transformer, achieving improvements of 2.6 percent in accuracy and 2.2 points in macro-F1 score, alongside enhanced causal fidelity. The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent, demonstrating its suitability for real-world neonatal acoustic monitoring systems.

Related papers

Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement [24.109107195976346]
URSA-GAN is a domain-aware generative framework designed to mitigate mismatches in both noise and channel conditions.<n>We show that URSA-GAN effectively reduces character error rates in ASR and improves metrics in SE across diverse noisy and mismatched channel scenarios.
arXiv Detail & Related papers (2026-02-04T08:16:22Z)
Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z)
Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition [2.0391237204597363]
Speech Emotion Recognition systems often degrade in performance when exposed to unpredictable acoustic interference.<n>We propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks.
arXiv Detail & Related papers (2025-12-20T10:05:58Z)
Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection [53.689841037081834]
Ivan-ISTD is designed to address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD.<n>Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-14T07:48:31Z)
Ecologically Valid Benchmarking and Adaptive Attention: Scalable Marine Bioacoustic Monitoring [2.558238597112103]
GetNetUPAM is a nested cross-validation framework to model stability under realistic variability.<n>Data are partitioned into distinct site-year segments, preserving recording and ensuring each validation fold reflects a unique environmental subset.<n>ARPA-N achieves a 14.4% gain in average precision over DenseNet baselines and a log2-scale order-of-magnitude drop in variability across all metrics.
arXiv Detail & Related papers (2025-09-04T22:03:05Z)
ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals [8.411477071838592]
We propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings.<n>We evaluate our method on various kinds of machine signal datasets.
arXiv Detail & Related papers (2025-08-20T13:10:44Z)
Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection [67.84730634802204]
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management.<n>Most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions.<n>We observe that frequency-domain feature modeling particularly in the wavelet domain amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain.
arXiv Detail & Related papers (2025-08-07T11:14:16Z)
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
arXiv Detail & Related papers (2025-07-03T14:29:43Z)
Adaptive Control Attention Network for Underwater Acoustic Localization and Domain Adaptation [8.017203108408973]
Localizing acoustic sound sources in the ocean is a challenging task due to the complex and dynamic nature of the environment.<n>We propose a multi-branch network architecture designed to accurately predict the distance between a moving acoustic source and a receiver.<n>Our proposed method outperforms state-of-the-art (SOTA) approaches in similar settings.
arXiv Detail & Related papers (2025-06-20T18:13:30Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Decoupled Doubly Contrastive Learning for Cross Domain Facial Action Unit Detection [66.80386429324196]
We propose a decoupled doubly contrastive adaptation (D$2$CA) approach to learn a purified AU representation.<n>D$2$CA is trained to disentangle AU and domain factors by assessing the quality of synthesized faces.<n>It consistently outperforms state-of-the-art cross-domain AU detection approaches.
arXiv Detail & Related papers (2025-03-12T00:42:17Z)
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques [11.97036509133719]
This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase.
arXiv Detail & Related papers (2020-05-24T08:13:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.