End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning
- URL: http://arxiv.org/abs/2107.05382v1
- Date: Wed, 7 Jul 2021 12:52:49 GMT
- Title: End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning
- Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota
Orihashi, Naoki Makishima
- Abstract summary: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems.
The Key process in our learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset.
Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
- Score: 28.516240952627076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a semi-supervised learning method for building end-to-end rich
transcription-style automatic speech recognition (RT-ASR) systems from
small-scale rich transcription-style and large-scale common transcription-style
datasets. In spontaneous speech tasks, various speech phenomena such as
fillers, word fragments, laughter and coughs, etc. are often included. While
common transcriptions do not give special awareness to these phenomena, rich
transcriptions explicitly convert them into special phenomenon tokens as well
as textual tokens. In previous studies, the textual and phenomenon tokens were
simultaneously estimated in an end-to-end manner. However, it is difficult to
build accurate RT-ASR systems because large-scale rich transcription-style
datasets are often unavailable. To solve this problem, our training method uses
a limited rich transcription-style dataset and common transcription-style
dataset simultaneously. The Key process in our semi-supervised learning is to
convert the common transcription-style dataset into a pseudo-rich
transcription-style dataset. To this end, we introduce style tokens which
control phenomenon tokens are generated or not into transformer-based
autoregressive modeling. We use this modeling for generating the pseudo-rich
transcription-style datasets and for building RT-ASR system from the pseudo and
original datasets. Our experiments on spontaneous ASR tasks showed the
effectiveness of the proposed method.
Related papers
- Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z) - General Detection-based Text Line Recognition [15.761142324480165]
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR)
Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding.
We improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
arXiv Detail & Related papers (2024-09-25T17:05:55Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Improving Data Driven Inverse Text Normalization using Data Augmentation [14.820077884045645]
Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form.
We present a data augmentation technique that effectively generates rich spoken-written numeric pairs from out-of-domain textual data.
We empirically demonstrate that ITN model trained using our data augmentation technique consistently outperform ITN model trained using only in-domain data.
arXiv Detail & Related papers (2022-07-20T06:07:26Z) - Context-Aware Transformer Transducer for Speech Recognition [21.916660252023707]
We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals.
We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
arXiv Detail & Related papers (2021-11-05T04:14:35Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - TS-Net: OCR Trained to Switch Between Text Transcription Styles [0.0]
We propose to extend existing text recognition networks with a Transcription Style Block (TSB)
TSB can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules.
We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data.
arXiv Detail & Related papers (2021-03-09T15:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.