OxfordVGG Submission to the EGO4D AV Transcription Challenge
- URL: http://arxiv.org/abs/2307.09006v1
- Date: Tue, 18 Jul 2023 06:48:39 GMT
- Title: OxfordVGG Submission to the EGO4D AV Transcription Challenge
- Authors: Jaesung Huh, Max Bain and Andrew Zisserman
- Abstract summary: This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team.
We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available.
Our final submission obtained 56.2% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard.
- Score: 81.13727731938582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report presents the technical details of our submission on the EGO4D
Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the
OxfordVGG team. We present WhisperX, a system for efficient speech
transcription of long-form audio with word-level time alignment, along with two
text normalisers which are publicly available. Our final submission obtained
56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the
leaderboard. All baseline codes and models are available on
https://github.com/m-bain/whisperX.
Related papers
- Distilling an End-to-End Voice Assistant Without Instruction Training Data [53.524071162124464]
Distilled Voice Assistant (DiVA) generalizes to Question Answering, Classification, and Translation.
We show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio.
arXiv Detail & Related papers (2024-10-03T17:04:48Z) - The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge [20.903716738950468]
We describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge.
Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest among all submissions.
arXiv Detail & Related papers (2024-04-09T07:37:41Z) - ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus [7.97238074132292]
IroyinSpeech is a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorub'a speech data.
We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0.
arXiv Detail & Related papers (2023-07-29T20:42:50Z) - STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced
Audio-Visual Diarization [3.9886149789339327]
This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023.
Our key innovation is that we model all the speakers in a video using a single, unified heterogeneous graph learning framework.
Our final method obtains 61.1% DER on the test set of Ego4D, which significantly outperforms all the baselines as well as last year's winner.
arXiv Detail & Related papers (2023-06-18T17:55:02Z) - AVATAR submission to the Ego4D AV Transcription Challenge [79.21857972093332]
Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images.
Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
arXiv Detail & Related papers (2022-11-18T01:03:30Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning
with Keywords and Sentence Length Estimation [49.41766997393417]
This report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6.
Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy.
We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning.
arXiv Detail & Related papers (2020-07-01T04:26:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.