The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning
with Keywords and Sentence Length Estimation
- URL: http://arxiv.org/abs/2007.00225v1
- Date: Wed, 1 Jul 2020 04:26:27 GMT
- Title: The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning
with Keywords and Sentence Length Estimation
- Authors: Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio
Kashino
- Abstract summary: This report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6.
Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy.
We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning.
- Score: 49.41766997393417
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This technical report describes the system participating to the Detection and
Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6:
automated audio captioning. Our submission focuses on solving two indeterminacy
problems in automated audio captioning: word selection indeterminacy and
sentence length indeterminacy. We simultaneously solve the main caption
generation and sub indeterminacy problems by estimating keywords and sentence
length through multi-task learning. We tested a simplified model of our
submission using the development-testing dataset. Our model achieved 20.7
SPIDEr score where that of the baseline system was 5.4.
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z) - OxfordVGG Submission to the EGO4D AV Transcription Challenge [81.13727731938582]
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team.
We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available.
Our final submission obtained 56.2% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard.
arXiv Detail & Related papers (2023-07-18T06:48:39Z) - Cross-lingual Alzheimer's Disease detection based on paralinguistic and
pre-trained features [6.928826160866143]
We present our submission to the ICASSP-SPGC-2023 ADReSS-M Challenge Task.
This task aims to investigate which acoustic features can be generalized and transferred across languages for Alzheimer's Disease prediction.
We extract paralinguistic features using openSmile toolkit and acoustic features using XLSR-53.
Our method achieves an accuracy of 69.6% on the classification task and a root mean squared error (RMSE) of 4.788 on the regression task.
arXiv Detail & Related papers (2023-03-14T06:34:18Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Auto-KWS 2021 Challenge: Task, Datasets, and Baselines [63.82759886293636]
Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task.
The challenge focuses on the problem of customized keyword spotting, where the target device can only be awakened by an enrolled speaker with his specified keyword.
arXiv Detail & Related papers (2021-03-31T14:56:48Z) - AutoSpeech 2020: The Second Automated Machine Learning Challenge for
Speech Classification [31.22181821515342]
The AutoSpeech challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to speech processing tasks.
This paper outlines the challenge protocol, datasets, evaluation metric, starting kit, and baseline systems.
arXiv Detail & Related papers (2020-10-25T15:01:41Z) - A Transformer-based Audio Captioning Model with Keyword Estimation [36.507981376481354]
One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene.
We propose a Transformer-based audio-captioning model with keyword estimation called TRACKE.
It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification.
arXiv Detail & Related papers (2020-07-01T04:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.