Automatic Standardization of Colloquial Persian
- URL: http://arxiv.org/abs/2012.05879v1
- Date: Thu, 10 Dec 2020 18:39:26 GMT
- Title: Automatic Standardization of Colloquial Persian
- Authors: Mohammad Sadegh Rasooli, Farzane Bakhtyari, Fatemeh Shafiei, Mahsa
Ravanbakhsh, Chris Callison-Burch
- Abstract summary: Most natural language processing tools for Persian assume that the text is in standard form.
This paper describes a simple and effective standardization approach based on sequence-to-sequence translation.
- Score: 15.192770717442302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Iranian Persian language has two varieties: standard and colloquial. Most
natural language processing tools for Persian assume that the text is in
standard form: this assumption is wrong in many real applications especially
web content. This paper describes a simple and effective standardization
approach based on sequence-to-sequence translation. We design an algorithm for
generating artificial parallel colloquial-to-standard data for learning a
sequence-to-sequence model. Moreover, we annotate a publicly available
evaluation data consisting of 1912 sentences from a diverse set of domains. Our
intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to
an off-the-shelf rule-based standardization model in which the original text
has a BLEU score of 46.4. We also show that our model improves
English-to-Persian machine translation in scenarios for which the training data
is from colloquial Persian with 1.4 absolute BLEU score difference in the
development data, and 0.8 in the test data.
Related papers
- FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Statistical Machine Translation for Indic Languages [1.8899300124593648]
This paper canvasses about the development of bilingual Statistical Machine Translation models.
To create the system, MOSES open-source SMT toolkit is explored.
In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2023-01-02T06:23:12Z) - Consistency Regularization for Cross-Lingual Fine-Tuning [61.08704789561351]
We propose to improve cross-lingual fine-tuning with consistency regularization.
Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations.
Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks.
arXiv Detail & Related papers (2021-06-15T15:35:44Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus [15.807197703827818]
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets.
We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages.
We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
arXiv Detail & Related papers (2020-10-27T19:29:17Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z) - Reducing Non-Normative Text Generation from Language Models [7.293053431456775]
Large-scale language models such as GPT-2 are pretrained on diverse corpora scraped from the internet.
We introduce a technique for fine-tuning GPT-2 using a policy gradient reinforcement learning technique and a normative text classifier.
arXiv Detail & Related papers (2020-01-23T19:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.