TS-Net: OCR Trained to Switch Between Text Transcription Styles
- URL: http://arxiv.org/abs/2103.05489v1
- Date: Tue, 9 Mar 2021 15:21:40 GMT
- Title: TS-Net: OCR Trained to Switch Between Text Transcription Styles
- Authors: Jan Koh\'ut, Michal Hradi\v{s}
- Abstract summary: We propose to extend existing text recognition networks with a Transcription Style Block (TSB)
TSB can learn from data to switch between multiple transcription styles without any explicit knowledge of transcription rules.
We show that TSB is able to learn completely different transcription styles in controlled experiments on artificial data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Users of OCR systems, from different institutions and scientific disciplines,
prefer and produce different transcription styles. This presents a problem for
training of consistent text recognition neural networks on real-world data. We
propose to extend existing text recognition networks with a Transcription Style
Block (TSB) which can learn from data to switch between multiple transcription
styles without any explicit knowledge of transcription rules. TSB is an
adaptive instance normalization conditioned by identifiers representing
consistently transcribed documents (e.g. single document, documents by a single
transcriber, or an institution). We show that TSB is able to learn completely
different transcription styles in controlled experiments on artificial data, it
improves text recognition accuracy on large-scale real-world data, and it
learns semantically meaningful transcription style embedding. We also show how
TSB can efficiently adapt to transcription styles of new documents from
transcriptions of only a few text lines.
Related papers
- Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems [0.0]
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts.
Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases.
The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
arXiv Detail & Related papers (2024-10-03T14:43:43Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse
Representations and Content Enhancing [73.81778485157234]
Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences.
We formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style.
We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder.
arXiv Detail & Related papers (2022-08-29T08:47:49Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - End-to-End Rich Transcription-Style Automatic Speech Recognition with
Semi-Supervised Learning [28.516240952627076]
We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems.
The Key process in our learning is to convert the common transcription-style dataset into a pseudo-rich transcription-style dataset.
Our experiments on spontaneous ASR tasks showed the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-07-07T12:52:49Z) - Global Rhythm Style Transfer Without Text Transcriptions [98.09972075975976]
Prosody plays an important role in characterizing the style of a speaker or an emotion.
Most non-parallel voice or emotion style transfer algorithms do not convert any prosody information.
We propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
arXiv Detail & Related papers (2021-06-16T02:21:00Z) - Textual Supervision for Visually Grounded Spoken Language Understanding [51.93744335044475]
Visually-grounded models of spoken language understanding extract semantic information directly from speech.
This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain.
Recent work showed that these models can be improved if transcriptions are available at training time.
arXiv Detail & Related papers (2020-10-06T15:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.