Related papers: AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval

AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval

URL: http://arxiv.org/abs/2509.16649v1
Date: Sat, 20 Sep 2025 11:53:18 GMT
Title: AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
Authors: Hyun Jun Kim, Hyeong Yong Choi, Changwon Lim,
Abstract summary: This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6.<n>Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning.<n>Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.
Score: 11.868064182311462
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated clustering to introduce an auxiliary classification task for further finetuning. Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.

Related papers

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 [24.056321452209666]
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I)<n>We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies.
arXiv Detail & Related papers (2025-06-16T10:28:27Z)
Task Arithmetic for Language Expansion in Speech Translation [41.721843322787045]
We aim to build a one-to-many ST system from existing one-to-one ST systems using task arithmetic without re-training.<n>Experiments on MuST-C and CoVoST-2 show BLEU score improvements of up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83.
arXiv Detail & Related papers (2024-09-17T15:25:11Z)
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO. It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z)
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9 [4.328586290529485]
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task.<n>To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset.
arXiv Detail & Related papers (2024-06-17T06:19:14Z)
SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection [1.3886978730184498]
The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning to build classifiers for hallucination detection. It extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6.
arXiv Detail & Related papers (2024-04-04T18:01:21Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge [11.263392524468625]
The system focuses on adapting ASR models for low-resource Indian languages. The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks.
arXiv Detail & Related papers (2023-07-20T00:55:01Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
ESPnet-ST IWSLT 2021 Offline Speech Translation System [56.83606198051871]
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. Our best E2E system combined all the techniques with model ensembling and achieved 31.4 BLEU.
arXiv Detail & Related papers (2021-07-01T17:49:43Z)
CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System [55.43871578056878]
In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2.
arXiv Detail & Related papers (2021-06-07T11:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.