Developing an Informal-Formal Persian Corpus
- URL: http://arxiv.org/abs/2308.05336v1
- Date: Thu, 10 Aug 2023 04:57:34 GMT
- Title: Developing an Informal-Formal Persian Corpus
- Authors: Vahide Tajalli, Fateme Kalantari and Mehrnoush Shamsfard
- Abstract summary: We build a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level.
The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Informal language is a style of spoken or written language frequently used in
casual conversations, social media, weblogs, emails and text messages. In
informal writing, the language faces some lexical and/or syntactic changes
varying among different languages. Persian is one of the languages with many
differences between its formal and informal styles of writing, thus developing
informal language processing tools for this language seems necessary. Such a
converter needs a large aligned parallel corpus of colloquial-formal sentences
which can be useful for linguists to extract a regulated grammar and
orthography for colloquial Persian as is done for the formal language. In this
paper we explain our methodology in building a parallel corpus of 50,000
sentence pairs with alignments in the word/phrase level. The sentences were
attempted to cover almost all kinds of lexical and syntactic changes between
informal and formal Persian, therefore both methods of exploring and collecting
from the different resources of informal scripts and following the phonological
and morphological patterns of changes were applied to find as much instances as
possible. The resulting corpus has about 530,000 alignments and a dictionary
containing 49,397 word and phrase pairs.
Related papers
- Machine Translation to Control Formality Features in the Target Language [0.9208007322096532]
This research explores how machine learning methods are used to translate from English to languages with formality.
It was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model.
We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth.
arXiv Detail & Related papers (2023-11-22T15:42:51Z) - In What Languages are Generative Language Models the Most Formal?
Analyzing Formality Distribution across Languages [2.457872341625575]
In this work, we focus on one language property highly influenced by culture: formality.
We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages.
We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions.
arXiv Detail & Related papers (2023-02-23T19:39:52Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Computational valency lexica and Homeric formularity [1.6346069386394704]
We present AGVaLex, a lexicon for ancient Greek automatically extracted from the Ancient Greek Dependency Treebank.
It contains quantitative corpus-driven morphological, syntactic and lexical information about verbs and their arguments.
It has a wide range of applications for the study of the language of ancient Greek authors.
arXiv Detail & Related papers (2022-08-23T08:03:16Z) - MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written
in Latin Script [0.05833117322405446]
We exploit the powerfulness of word embedding models generated with a corpus of YouTube comments.
We have built a normalization dictionary that we refer to as MANorm.
arXiv Detail & Related papers (2022-06-18T10:17:46Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - A Novel Corpus of Discourse Structure in Humans and Computers [55.74664144248097]
We present a novel corpus of 445 human- and computer-generated documents, comprising about 27,000 clauses.
The corpus covers both formal and informal discourse, and contains documents generated using fine-tuned GPT-2.
arXiv Detail & Related papers (2021-11-10T20:56:08Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus [0.0]
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging.
The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses.
arXiv Detail & Related papers (2020-05-13T19:23:43Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.