Related papers: Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

URL: http://arxiv.org/abs/2506.21174v1
Date: Thu, 26 Jun 2025 12:27:52 GMT
Title: Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4
Authors: Jongyeon Park, Joonhee Lee, Do-Hyeon Lim, Hong Kook Kim, Hyeongcheol Geum, Jeong Eun Lim,
Abstract summary: This report presents submission systems for Task 4 of the DCASE 2025 Challenge.<n>It incorporates additional audio features into the embedding feature extracted from the mel-spectral feature.<n>Second, an agent-based label correction system is applied to the outputs processed by the S5 system.
Score: 2.68085089595424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Confident Learning for Object Detection under Model Constraints [0.05131152350448099]
Agricultural weed detection on edge devices is subject to strict constraints on model capacity, computational resources, and real-time inference latency.<n>This paper proposes Model-Driven Data Correction (MDDC), a data-centric framework that enhances detection performance by iteratively diagnosing and correcting data quality deficiencies.
arXiv Detail & Related papers (2026-01-14T15:32:08Z)
An Enhanced Audio Feature Tailored for Anomalous Sound Detection Based on Pre-trained Models [34.59032968400701]
Anomalous Sound Detection (ASD) aims at identifying anomalous sounds from machines.<n>Uncertainty of anomaly location and much redundant information such as noise in machine sounds hinder the improvement of ASD system performance.<n>This paper proposes a novel audio feature of filter banks with evenly distributed intervals, ensuring equal attention to all frequency ranges in the audio.
arXiv Detail & Related papers (2025-08-21T08:04:08Z)
Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving Labs [0.49478969093606673]
This study develops an automated workflow to detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values.<n>A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features.
arXiv Detail & Related papers (2025-07-15T03:35:56Z)
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
arXiv Detail & Related papers (2025-07-03T14:29:43Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Self-Adaptive Gamma Context-Aware SSM-based Model for Metal Defect Detection [3.5792989228178897]
Metal defect detection is critical in industrial quality assurance.<n>Existing methods struggle with grayscale variations and complex defect states.<n>This paper proposes a Self-Adaptive Gamma Context-Aware SSM-based model.
arXiv Detail & Related papers (2025-03-03T06:57:54Z)
A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.<n>We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.<n>By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z)
CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity [8.377398103067508]
We introduce an attribution-oriented Chain-of-Thought reasoning method to enhance the accuracy of attributions.<n> Evaluations on two context-enhanced question-answering datasets using GPT-4 demonstrate improved accuracy and correctness of attributions.
arXiv Detail & Related papers (2024-04-16T12:37:10Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z)
Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD) We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature. We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z)
Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization. Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.