Joint Speech Activity and Overlap Detection with Multi-Exit Architecture
- URL: http://arxiv.org/abs/2209.11906v1
- Date: Sat, 24 Sep 2022 02:34:11 GMT
- Title: Joint Speech Activity and Overlap Detection with Multi-Exit Architecture
- Authors: Ziqing Du, Kai Liu, Xucheng Wan, Huan Zhou
- Abstract summary: Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion.
This study investigates the joint VAD and OSD task from a new perspective.
In particular, we propose to extend traditional classification network with multi-exit architecture.
- Score: 5.4878772986187565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overlapped speech detection (OSD) is critical for speech applications in
scenario of multi-party conversion. Despite numerous research efforts and
progresses, comparing with speech activity detection (VAD), OSD remains an open
challenge and its overall performance is far from satisfactory. The majority of
prior research typically formulates the OSD problem as a standard
classification problem, to identify speech with binary (OSD) or three-class
label (joint VAD and OSD) at frame level. In contrast to the mainstream, this
study investigates the joint VAD and OSD task from a new perspective. In
particular, we propose to extend traditional classification network with
multi-exit architecture. Such an architecture empowers our system with unique
capability to identify class using either low-level features from early exits
or high-level features from last exit. In addition, two training schemes,
knowledge distillation and dense connection, are adopted to further boost our
system performance. Experimental results on benchmark datasets (AMI and
DIHARD-III) validated the effectiveness and generality of our proposed system.
Our ablations further reveal the complementary contribution of proposed
schemes. With $F_1$ score of 0.792 on AMI and 0.625 on DIHARD-III, our proposed
system outperforms several top performing models on these datasets, but also
surpasses the current state-of-the-art by large margins across both datasets.
Besides the performance benefit, our proposed system offers another appealing
potential for quality-complexity trade-offs, which is highly preferred for
efficient OSD deployment.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection [11.250490586786878]
Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos.
We show that distilling knowledge from aggregated representations into a relatively simple model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-06-05T00:44:42Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Joint speech and overlap detection: a benchmark over multiple audio
setup and speech domains [0.0]
VAD and OSD can be trained jointly using a multi-class classification model.
This paper proposes a complete and new benchmark of different VAD and OSD models.
Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results.
arXiv Detail & Related papers (2023-07-24T14:29:21Z) - HKNAS: Classification of Hyperspectral Imagery Based on Hyper Kernel
Neural Architecture Search [104.45426861115972]
We propose to directly generate structural parameters by utilizing the specifically designed hyper kernels.
We obtain three kinds of networks to separately conduct pixel-level or image-level classifications with 1-D or 3-D convolutions.
A series of experiments on six public datasets demonstrate that the proposed methods achieve state-of-the-art results.
arXiv Detail & Related papers (2023-04-23T17:27:40Z) - Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images.
To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN.
In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z) - Improving Embedding Extraction for Speaker Verification with Ladder
Network [8.843122009658252]
Recent speaker verification (SV) systems rely on deep neural networks to extract high-level embeddings.
We propose to apply the ladder network framework in the SV systems, which combines the supervised and unsupervised learning fashions.
The proposed approach relatively improved the performance by 10% at most without adding parameters and augmented data.
arXiv Detail & Related papers (2020-03-20T07:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.