K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of
Graphemes and Syllables
- URL: http://arxiv.org/abs/2110.05172v1
- Date: Mon, 11 Oct 2021 11:53:12 GMT
- Title: K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of
Graphemes and Syllables
- Authors: Jounghee Kim, Pilsung Kang
- Abstract summary: K-Wav2Vec 2.0 is a modified version of Wav2vec 2.0 designed for Korean automatic speech recognition.
In fine-tuning, we propose a multi-task hierarchical architecture to reflect the Korean writing structure.
In pre-training, we attempted the cross-lingual transfer of the pre-trained model by further pre-training the English Wav2vec 2.0 on a Korean dataset.
- Score: 2.0813318162800707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wav2vec 2.0 is an end-to-end framework of self-supervised learning for speech
representation that is successful in automatic speech recognition (ASR), but
most of the work on the topic has been developed with a single language:
English. Therefore, it is unclear whether the self-supervised framework is
effective in recognizing other languages with different writing systems, such
as Korean which uses the Hangul having a unique writing system. In this paper,
we present K-Wav2Vec 2.0, which is a modified version of Wav2vec 2.0 designed
for Korean automatic speech recognition by exploring and optimizing various
factors of the original Wav2vec 2.0. In fine-tuning, we propose a multi-task
hierarchical architecture to reflect the Korean writing structure. Moreover, a
joint decoder is applied to alleviate the problem of words existing outside of
the vocabulary. In pre-training, we attempted the cross-lingual transfer of the
pre-trained model by further pre-training the English Wav2vec 2.0 on a Korean
dataset, considering limited resources. Our experimental results demonstrate
that the proposed method yields the best performance on both Korean ASR
datasets: Ksponspeech (a large-scale Korean speech corpus) and Clovacall (a
call-based dialog corpus). Further pre-training is also effective in language
adaptation, leading to large improvements without additional data.
Related papers
- GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation [58.72068260933836]
The input and output of the system are multimodal (i.e., audio and visual speech)
We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages.
In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech.
arXiv Detail & Related papers (2023-12-05T05:36:44Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages [49.6922490267701]
We introduce a new zero resource code-switched speech benchmark designed to assess the code-switching capabilities of self-supervised speech encoders.
We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed.
arXiv Detail & Related papers (2023-10-04T17:58:11Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 [7.378368959253632]
We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages.
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model.
arXiv Detail & Related papers (2021-10-07T15:29:22Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [1.7955614278088239]
KoSpeech is an end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch.
We propose preprocessing methods for KsponSpeech corpus and a baseline model for benchmarks.
Our baseline model achieved 10.31% character error rate (CER) at KsponSpeech corpus only with the acoustic model.
arXiv Detail & Related papers (2020-09-07T13:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.