UniCon+: ICTCAS-UCAS Submission to the AVA-ActiveSpeaker Task at
ActivityNet Challenge 2022
- URL: http://arxiv.org/abs/2206.10861v1
- Date: Wed, 22 Jun 2022 06:11:07 GMT
- Title: UniCon+: ICTCAS-UCAS Submission to the AVA-ActiveSpeaker Task at
ActivityNet Challenge 2022
- Authors: Yuanhang Zhang, Susan Liang, Shuang Yang, Shiguang Shan
- Abstract summary: This report presents a brief description of our winning solution to the AVA Active Speaker Detection (ASD) task at ActivityNet Challenge 2022.
Our underlying model UniCon+ continues to build on our previous work, the Unified Context Network (UniCon) and Extended UniCon.
We augment the architecture with a simple GRU-based module that allows information of recurring identities to flow across scenes.
- Score: 69.67841335302576
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This report presents a brief description of our winning solution to the AVA
Active Speaker Detection (ASD) task at ActivityNet Challenge 2022. Our
underlying model UniCon+ continues to build on our previous work, the Unified
Context Network (UniCon) and Extended UniCon which are designed for robust
scene-level ASD. We augment the architecture with a simple GRU-based module
that allows information of recurring identities to flow across scenes through
read and update operations. We report a best result of 94.47% mAP on the
AVA-ActiveSpeaker test set, which continues to rank first on this year's
challenge leaderboard and significantly pushes the state-of-the-art.
Related papers
- Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - Towards Attention-based Contrastive Learning for Audio Spoof Detection [3.08086566663567]
Vision transformers (ViT) have made substantial progress for classification tasks in computer vision.
We introduce ViTs for audio spoof detection task.
We propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning.
arXiv Detail & Related papers (2024-07-03T21:25:12Z) - Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z) - TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive
Learning [15.673602262069531]
Active Speaker Detection (ASD) is a task to determine whether a person is speaking or not in a series of video frames.
We propose TalkNCE, a novel talk-aware contrastive loss.
Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
arXiv Detail & Related papers (2023-09-21T17:59:11Z) - A Study on the Integration of Pipeline and E2E SLU systems for Spoken
Semantic Parsing toward STOP Quality Challenge [33.89616011003973]
We describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge.
Strong automatic speech recognition (ASR) models like Whisper and pretrained Language models (LM) like BART are utilized inside our SLU framework to boost performance.
We also investigate the output level combination of various models to get an exact match accuracy of 80.8, which won the 1st place at the challenge.
arXiv Detail & Related papers (2023-05-02T17:25:19Z) - Robust, General, and Low Complexity Acoustic Scene Classification
Systems and An Effective Visualization for Presenting a Sound Scene Context [53.80051967863102]
We present a comprehensive analysis of Acoustic Scene Classification (ASC)
We propose an inception-based and low footprint ASC model, referred to as the ASC baseline.
Next, we improve the ASC baseline by proposing a novel deep neural network architecture.
arXiv Detail & Related papers (2022-10-16T19:07:21Z) - Tongji University Undergraduate Team for the VoxCeleb Speaker
Recognition Challenge2020 [10.836635938778684]
We applied the RSBU-CW module to the ResNet34 framework to improve the denoising ability of the network.
We trained two variants of ResNet,used score fusion and data-augmentation methods to improve the performance of the model.
arXiv Detail & Related papers (2020-10-20T09:25:40Z) - 1st place solution for AVA-Kinetics Crossover in AcitivityNet Challenge
2020 [43.81722332148899]
This report introduces our winning solution to the action-temporal localization track, AVA-Kinetics, in ActivityNet Challenge 2020.
We describe technical details for the new AVA-Kinetics dataset, together with some experimental results.
Without any bells and whistles, we achieved 39.62 mAP on the test set of AVA-Kinetics, which outperforms other entries by a large margin.
arXiv Detail & Related papers (2020-06-16T12:52:59Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.