Transsion TSUP's speech recognition system for ASRU 2023 MADASR
Challenge
- URL: http://arxiv.org/abs/2307.11778v1
- Date: Thu, 20 Jul 2023 00:55:01 GMT
- Title: Transsion TSUP's speech recognition system for ASRU 2023 MADASR
Challenge
- Authors: Xiaoxiao Li, Gaosheng Zhang, An Zhu, Weiyong Li, Shuming Fang, Xiaoyue
Yang, Jianchao Zhu
- Abstract summary: The system focuses on adapting ASR models for low-resource Indian languages.
The proposed method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97% for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and 15.48% for Bhojpuri language in the four tracks.
- Score: 11.263392524468625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a speech recognition system developed by the Transsion
Speech Understanding Processing Team (TSUP) for the ASRU 2023 MADASR Challenge.
The system focuses on adapting ASR models for low-resource Indian languages and
covers all four tracks of the challenge. For tracks 1 and 2, the acoustic model
utilized a squeezeformer encoder and bidirectional transformer decoder with
joint CTC-Attention training loss. Additionally, an external KenLM language
model was used during TLG beam search decoding. For tracks 3 and 4, pretrained
IndicWhisper models were employed and finetuned on both the challenge dataset
and publicly available datasets. The whisper beam search decoding was also
modified to support an external KenLM language model, which enabled better
utilization of the additional text provided by the challenge. The proposed
method achieved word error rates (WER) of 24.17%, 24.43%, 15.97%, and 15.97%
for Bengali language in the four tracks, and WER of 19.61%, 19.54%, 15.48%, and
15.48% for Bhojpuri language in the four tracks. These results demonstrate the
effectiveness of the proposed method.
Related papers
- OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Automatic Speech Recognition Advancements for Indigenous Languages of the Americas [0.0]
The Second Americas (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana.
We describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods.
We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech
Systems for the MADASR 2023 Challenge [2.018088271426157]
This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge.
The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data.
TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data.
arXiv Detail & Related papers (2023-10-26T14:57:08Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - An Automatic Speech Recognition System for Bengali Language based on
Wav2Vec2 and Transfer Learning [0.0]
This paper aims to improve the speech recognition performance of the Bengali language by adopting speech recognition technology on the E2E structure based on the transfer learning framework.
The proposed method effectively models the Bengali language and achieves 3.819 score in Levenshtein Mean Distance' on the test dataset of 7747 samples, when only 1000 samples of train dataset were used to train.
arXiv Detail & Related papers (2022-09-16T18:20:16Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Pretraining Approaches for Spoken Language Recognition: TalTech
Submission to the OLR 2021 Challenge [0.0]
The paper is based on our submission to the Oriental Language Recognition 2021 Challenge.
For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition.
For the unconstrained task, we relied on both externally available pretrained models as well as external data.
arXiv Detail & Related papers (2022-05-14T15:17:08Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.