STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced
Audio-Visual Diarization
- URL: http://arxiv.org/abs/2306.10608v4
- Date: Tue, 31 Oct 2023 18:34:12 GMT
- Title: STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced
Audio-Visual Diarization
- Authors: Kyle Min
- Abstract summary: This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023.
Our key innovation is that we model all the speakers in a video using a single, unified heterogeneous graph learning framework.
Our final method obtains 61.1% DER on the test set of Ego4D, which significantly outperforms all the baselines as well as last year's winner.
- Score: 3.9886149789339327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report introduces our novel method named STHG for the Audio-Visual
Diarization task of the Ego4D Challenge 2023. Our key innovation is that we
model all the speakers in a video using a single, unified heterogeneous graph
learning framework. Unlike previous approaches that require a separate
component solely for the camera wearer, STHG can jointly detect the speech
activities of all people including the camera wearer. Our final method obtains
61.1% DER on the test set of Ego4D, which significantly outperforms all the
baselines as well as last year's winner. Our submission achieved 1st place in
the Ego4D Challenge 2023. We additionally demonstrate that applying the
off-the-shelf speech recognition system to the diarized speech segments by STHG
produces a competitive performance on the Speech Transcription task of this
challenge.
Related papers
- Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 [8.940008511570207]
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
arXiv Detail & Related papers (2024-09-03T21:28:45Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - A Study on the Integration of Pipeline and E2E SLU systems for Spoken
Semantic Parsing toward STOP Quality Challenge [33.89616011003973]
We describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge.
Strong automatic speech recognition (ASR) models like Whisper and pretrained Language models (LM) like BART are utilized inside our SLU framework to boost performance.
We also investigate the output level combination of various models to get an exact match accuracy of 80.8, which won the 1st place at the challenge.
arXiv Detail & Related papers (2023-05-02T17:25:19Z) - ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z) - AVATAR submission to the Ego4D AV Transcription Challenge [79.21857972093332]
Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images.
Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
arXiv Detail & Related papers (2022-11-18T01:03:30Z) - Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual
Diarization [3.9886149789339327]
This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022.
We improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model.
Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities.
arXiv Detail & Related papers (2022-10-14T12:54:03Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.