TCG CREST System Description for the Second DISPLACE Challenge
- URL: http://arxiv.org/abs/2409.15356v1
- Date: Mon, 16 Sep 2024 05:13:34 GMT
- Title: TCG CREST System Description for the Second DISPLACE Challenge
- Authors: Nikhil Raghav, Subhajit Saha, Md Sahidullah, Swagatam Das,
- Abstract summary: We describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024.
Our contributions were dedicated to Track 1 for SD and Track 2 for LD in multilingual and multi-speaker scenarios.
- Score: 19.387615374726444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024. Our contributions were dedicated to Track 1 for SD and Track 2 for LD in multilingual and multi-speaker scenarios. We investigated different speech enhancement techniques, voice activity detection (VAD) techniques, unsupervised domain categorization, and neural embedding extraction architectures. We also exploited the fusion of various embedding extraction models. We implemented our system with the open-source SpeechBrain toolkit. Our final submissions use spectral clustering for both the speaker and language diarization. We achieve about $7\%$ relative improvement over the challenge baseline in Track 1. We did not obtain improvement over the challenge baseline in Track 2.
Related papers
- The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments [28.460119283649913]
The dataset contains 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings.
12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages.
We have compared our baseline models and the team's performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge.
arXiv Detail & Related papers (2024-06-13T17:32:32Z) - Summary of the DISPLACE Challenge 2023 -- DIarization of SPeaker and
LAnguage in Conversational Environments [28.618333018398122]
In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages.
Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers.
The DISPLACE challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition.
arXiv Detail & Related papers (2023-11-21T12:23:58Z) - Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech
Systems for the MADASR 2023 Challenge [2.018088271426157]
This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge.
The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data.
TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data.
arXiv Detail & Related papers (2023-10-26T14:57:08Z) - Improving Cascaded Unsupervised Speech Translation with Denoising
Back-translation [70.33052952571884]
We propose to build a cascaded speech translation system without leveraging any kind of paired data.
We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS.
arXiv Detail & Related papers (2023-05-12T13:07:51Z) - A Study on the Integration of Pipeline and E2E SLU systems for Spoken
Semantic Parsing toward STOP Quality Challenge [33.89616011003973]
We describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge.
Strong automatic speech recognition (ASR) models like Whisper and pretrained Language models (LM) like BART are utilized inside our SLU framework to boost performance.
We also investigate the output level combination of various models to get an exact match accuracy of 80.8, which won the 1st place at the challenge.
arXiv Detail & Related papers (2023-05-02T17:25:19Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual
Transformers with Joint Student-Teacher Learning [70.56330507503867]
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track.
This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10.
arXiv Detail & Related papers (2021-10-13T17:24:16Z) - ESPnet-ST IWSLT 2021 Offline Speech Translation System [56.83606198051871]
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track.
This year we made various efforts on training data, architecture, and audio segmentation.
Our best E2E system combined all the techniques with model ensembling and achieved 31.4 BLEU.
arXiv Detail & Related papers (2021-07-01T17:49:43Z) - USTC-NELSLIP System Description for DIHARD-III Challenge [78.40959509760488]
The innovation of our system lies in the combination of various front-end techniques to solve the diarization problem.
Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set.
arXiv Detail & Related papers (2021-03-19T07:00:51Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - DIHARD II is Still Hard: Experimental Results and Discussions from the
DKU-LENOVO Team [22.657782236219933]
We present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULE team.
Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection.
Although our systems have reduced the DERs by 27.5% and 31.7% relatively against the official baselines, we believe that the diarization task is still very difficult.
arXiv Detail & Related papers (2020-02-23T11:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.