"This is Houston. Say again, please". The Behavox system for the
Apollo-11 Fearless Steps Challenge (phase II)
- URL: http://arxiv.org/abs/2008.01504v1
- Date: Tue, 4 Aug 2020 13:18:28 GMT
- Title: "This is Houston. Say again, please". The Behavox system for the
Apollo-11 Fearless Steps Challenge (phase II)
- Authors: Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman
- Abstract summary: We describe the speech activity detection (SAD), speaker diarization (SD), and automatic speech recognition (ASR) experiments conducted by the Behavox team for the Interspeech 2020 Fearless Steps Challenge (FSC-2)
A relatively small amount of labeled data, a large variety of speakers and channel distortions, specific lexicon and speaking style resulted in high error rates on the systems which involved this data.
For all systems, we report substantial performance improvements compared to the FSC-2 baseline systems.
- Score: 3.3263205689999453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe the speech activity detection (SAD), speaker diarization (SD),
and automatic speech recognition (ASR) experiments conducted by the Behavox
team for the Interspeech 2020 Fearless Steps Challenge (FSC-2). A relatively
small amount of labeled data, a large variety of speakers and channel
distortions, specific lexicon and speaking style resulted in high error rates
on the systems which involved this data. In addition to approximately 36 hours
of annotated NASA mission recordings, the organizers provided a much larger but
unlabeled 19k hour Apollo-11 corpus that we also explore for semi-supervised
training of ASR acoustic and language models, observing more than 17% relative
word error rate improvement compared to training on the FSC-2 data only. We
also compare several SAD and SD systems to approach the most difficult tracks
of the challenge (track 1 for diarization and ASR), where long 30-minute audio
recordings are provided for evaluation without segmentation or speaker
information. For all systems, we report substantial performance improvements
compared to the FSC-2 baseline systems, and achieved a first-place ranking for
SD and ASR and fourth-place for SAD in the challenge.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - AG-LSEC: Audio Grounded Lexical Speaker Error Correction [9.54540722574194]
Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines.
We propose to enhance and acoustically ground the Lexical Speaker Error Correction (LSEC) system with speaker scores directly derived from the existing SD pipeline.
This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
arXiv Detail & Related papers (2024-06-25T04:20:49Z) - The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments [28.460119283649913]
The dataset contains 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings.
12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages.
We have compared our baseline models and the team's performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge.
arXiv Detail & Related papers (2024-06-13T17:32:32Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - BERTraffic: A Robust BERT-Based Approach for Speaker Change Detection
and Role Identification of Air-Traffic Communications [2.270534915073284]
Speech Activity Detection (SAD) or diarization system fails and then two or more single speaker segments are in the same recording.
We developed a system that combines the segmentation of a SAD module with a BERT-based model that performs Speaker Change Detection (SCD) and Speaker Role Identification (SRI) based on ASR transcripts (i.e., diarization + SRI)
The proposed model reaches up to 0.90/0.95 F1-score on ATCO/pilot for SRI on several test sets.
arXiv Detail & Related papers (2021-10-12T07:25:12Z) - Scenario Aware Speech Recognition: Advancements for Apollo Fearless
Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL.
We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z) - EML Online Speech Activity Detection for the Fearless Steps Challenge
Phase-III [7.047338765733677]
This paper describes the online algorithm for the most recent phase of Fearless Steps challenge.
The proposed algorithm can be trained both in a supervised and unsupervised manner.
Experiments show a competitive accuracy on both development and evaluation datasets with a real-time factor of about 0.002 using a single CPU machine.
arXiv Detail & Related papers (2021-06-21T12:55:51Z) - Automatic Speech Recognition Benchmark for Air-Traffic Communications [1.175956452196938]
CleanSky EC-H2020 ATCO2 aims to develop an ASR-based platform to collect, organize and automatically pre-process ATCo speech-data from air space.
Cross-accent flaws due to speakers' accents are minimized due to the amount of data, making the system feasible for ATC environments.
arXiv Detail & Related papers (2020-06-18T06:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.