Multilingual Speech Recognition for Low-Resource Indian Languages using
Multi-Task conformer
- URL: http://arxiv.org/abs/2109.03969v1
- Date: Sun, 22 Aug 2021 09:32:15 GMT
- Title: Multilingual Speech Recognition for Low-Resource Indian Languages using
Multi-Task conformer
- Authors: Krishna D N
- Abstract summary: We propose a multi-task learning-based transformer model for low-resource multilingual speech recognition for Indian languages.
We use a phoneme decoder for the phoneme recognition task and a grapheme decoder to predict grapheme sequence.
Our proposed approach can obtain significant improvement over previous approaches.
- Score: 4.594159253008448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have recently become very popular for sequence-to-sequence
applications such as machine translation and speech recognition. In this work,
we propose a multi-task learning-based transformer model for low-resource
multilingual speech recognition for Indian languages. Our proposed model
consists of a conformer [1] encoder and two parallel transformer decoders. We
use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme
decoder (GRP-DEC) to predict grapheme sequence. We consider the phoneme
recognition task as an auxiliary task for our multi-task learning framework. We
jointly optimize the network for both phoneme and grapheme recognition tasks
using Joint CTC-Attention [2] training. We use a conditional decoding scheme to
inject the language information into the model before predicting the grapheme
sequence. Our experiments show that our proposed approach can obtain
significant improvement over previous approaches [4]. We also show that our
conformer-based dual-decoder approach outperforms both the transformer-based
dual-decoder approach and single decoder approach. Finally, We compare
monolingual ASR models with our proposed multilingual ASR approach.
Related papers
- Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - Online Gesture Recognition using Transformer and Natural Language
Processing [0.0]
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
Transformer architecture is shown to provide a powerful machine framework for online gestures corresponding to glyph strokes of natural language sentences.
arXiv Detail & Related papers (2023-05-05T10:17:22Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Scaling Up Deliberation for Multilingual ASR [36.860327600638705]
We investigate second-pass deliberation for multilingual speech recognition.
Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the decoder attends to multilingual text and audio.
We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model.
arXiv Detail & Related papers (2022-10-11T21:07:00Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - A Dual-Decoder Conformer for Multilingual Speech Recognition [4.594159253008448]
This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages.
We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information.
Our experiments show that we can obtain a significant reduction in WER over the baseline approaches.
arXiv Detail & Related papers (2021-08-22T09:22:28Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.