Effects of Word-frequency based Pre- and Post- Processings for Audio
Captioning
- URL: http://arxiv.org/abs/2009.11436v1
- Date: Thu, 24 Sep 2020 01:07:33 GMT
- Title: Effects of Word-frequency based Pre- and Post- Processings for Audio
Captioning
- Authors: Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio
Kashino
- Abstract summary: The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning.
The system received the highest evaluation scores, but which of the individual elements most fully contributed to its perfor-mance has not yet been clarified.
- Score: 49.41766997393417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The system we used for Task 6 (Automated Audio Captioning)of the Detection
and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines
three elements, namely, dataaugmentation, multi-task learning, and
post-processing, for audiocaptioning. The system received the highest
evaluation scores, butwhich of the individual elements most fully contributed
to its perfor-mance has not yet been clarified. Here, to asses their
contributions,we first conducted an element-wise ablation study on our systemto
estimate to what extent each element is effective. We then con-ducted a
detailed module-wise ablation study to further clarify thekey processing
modules for improving accuracy. The results showthat data augmentation and
post-processing significantly improvethe score in our system. In particular,
mix-up data augmentationand beam search in post-processing improve SPIDEr by
0.8 and 1.6points, respectively.
Related papers
- TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - Hate Content Detection via Novel Pre-Processing Sequencing and Ensemble Methods [15.647035299476894]
Social media, particularly Twitter, has seen a significant increase in incidents like trolling and hate speech.
This paper introduces a computational framework to curb the hate content on the web.
arXiv Detail & Related papers (2024-09-08T15:32:17Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Advancing Natural-Language Based Audio Retrieval with PaSST and Large
Audio-Caption Data Sets [6.617487928813374]
We present a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.
Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.
arXiv Detail & Related papers (2023-08-08T13:46:55Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Deep Feature Learning for Medical Acoustics [78.56998585396421]
The purpose of this paper is to compare different learnables in medical acoustics tasks.
A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies.
arXiv Detail & Related papers (2022-08-05T10:39:37Z) - Low-complexity deep learning frameworks for acoustic scene
classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z) - Transferring Voice Knowledge for Acoustic Event Detection: An Empirical
Study [11.825240267691209]
This paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an acoustic event detection pipeline.
We develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process.
arXiv Detail & Related papers (2021-10-07T04:03:21Z) - Towards Neural Diarization for Unlimited Numbers of Speakers Using
Global and Local Attractors [51.01295414889487]
We introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization.
Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets.
arXiv Detail & Related papers (2021-07-04T05:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.