Related papers: Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks

Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks

URL: http://arxiv.org/abs/2004.11405v2
Date: Wed, 21 Oct 2020 09:08:53 GMT
Title: Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks
Authors: Ori Terner, Kfir Bar, Nachum Dershowitz
Abstract summary: We train a model to automatically transliterate Judeo-Arabic texts into Arabic script. We employ a recurrent neural network (RNN), combined with the connectionist temporal classification (CTC) loss to deal with unequal input/output lengths. We obtain an improvement over the baseline 9.5% character error, achieving 2% error with our best configuration.
Score: 1.933681537640272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We trained a model to automatically transliterate Judeo-Arabic texts into Arabic script, enabling Arabic readers to access those writings. We employ a recurrent neural network (RNN), combined with the connectionist temporal classification (CTC) loss to deal with unequal input/output lengths. This obligates adjustments in the training data to avoid input sequences that are shorter than their corresponding outputs. We also utilize a pretraining stage with a different loss function to improve network converge. Since only a single source of parallel text was available for training, we take advantage of the possibility of generating data synthetically. We train a model that has the capability to memorize words in the output language, and that also utilizes context for distinguishing ambiguities in the transliteration. We obtain an improvement over the baseline 9.5% character error, achieving 2% error with our best configuration. To measure the contribution of context to learning, we also tested word-shuffled data, for which the error rises to 2.5%.

Related papers

Towards Faster k-Nearest-Neighbor Machine Translation [51.866464707284635]
k-nearest-neighbor machine translation approaches suffer from heavy retrieve overhead on the entire datastore when decoding each token. We propose a simple yet effective multi-layer perceptron (MLP) network to predict whether a token should be translated jointly by the neural machine translation model and probabilities produced by the kNN. Our method significantly reduces the overhead of kNN retrievals by up to 53% at the expense of a slight decline in translation quality.
arXiv Detail & Related papers (2023-12-12T16:41:29Z)
Prompt-based Learning for Text Readability Assessment [0.4757470449749875]
We propose the novel adaptation of a pre-trained seq2seq model for readability assessment. We prove that a seq2seq model can be adapted to discern which text is more difficult from two given texts (pairwise)
arXiv Detail & Related papers (2023-02-25T18:39:59Z)
Speech-text based multi-modal training with bidirectional attention for improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z)
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. We propose reverse KD to rejuvenate more alignments for low-frequency target words. Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z)
Neural String Edit Distance [77.72325513792981]
We propose the neural string edit distance model for string-pair classification and sequence generation. We modify the original expectation-maximization learned edit distance algorithm into a differentiable loss function. We show that we can trade off between performance and interpretability in a single framework.
arXiv Detail & Related papers (2021-04-16T22:16:47Z)
Detecting over/under-translation errors for determining adequacy in human translations [0.0]
We present a novel approach to detecting over and under translations (OT/UT) as part of adequacy error checks in translation evaluation. We do not restrict ourselves to machine translation (MT) outputs and specifically target applications with human generated translation pipeline. The goal of our system is to identify OT/UT errors from human translated video subtitles with high error recall.
arXiv Detail & Related papers (2021-04-01T06:06:36Z)
BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets [6.18447297698017]
We propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots. Our new transfer-learning model has obtained an accuracy up to 98%. To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts.
arXiv Detail & Related papers (2021-01-22T21:50:38Z)
Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z)
Offline Handwritten Chinese Text Recognition with Convolutional Neural Networks [5.984124397831814]
In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function. We achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.
arXiv Detail & Related papers (2020-06-28T14:34:38Z)
Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.