TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish
Corpus
- URL: http://arxiv.org/abs/2003.09520v2
- Date: Tue, 24 Mar 2020 12:00:28 GMT
- Title: TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish
Corpus
- Authors: Elisa Gugliotta, Marco Dinarelli
- Abstract summary: This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC)
Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters)
- Score: 3.8580784887142774
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This article describes the constitution process of the first
morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also
known as Arabizi, is a spontaneous coding of Arabic dialects in Latin
characters and arithmographs (numbers used as letters). This code-system was
developed by Arabic-speaking users of social media in order to facilitate the
writing in the Computer-Mediated Communication (CMC) and text messaging
informal frameworks. There is variety in the realization of Arabish amongst
dialects, and each Arabish code-system is under-resourced, in the same way as
most of the Arabic dialects. In the last few years, the focus on Arabic
dialects in the NLP field has considerably increased. Taking this into
consideration, TArC will be a useful support for different types of analyses,
computational and linguistic, as well as for NLP tools training. In this
article we will describe preliminary work on the TArC semi-automatic
construction process and some of the first analyses we developed on TArC. In
addition, in order to provide a complete overview of the challenges faced
during the building process, we will present the main Tunisian dialect
characteristics and their encoding in Tunisian Arabish.
Related papers
- Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA)
We benchmark newly developed sequence-to-sequence models on the task of CODAfication.
We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - TArC: Tunisian Arabish Corpus First complete release [0.0]
We present the final result of a project on Tunisian Arabic encoded in Arabizi.
The project led to the creation of two integrated and independent resources.
arXiv Detail & Related papers (2022-07-11T11:46:59Z) - Sentiment Analysis in Poems in Misurata Sub-dialect -- A Sentiment
Detection in an Arabic Sub-dialect [0.0]
This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Libya.
The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool 1.
arXiv Detail & Related papers (2021-09-15T10:42:39Z) - Automatic Arabic Dialect Identification Systems for Written Texts: A
Survey [0.0]
Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text.
In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts.
We review the traditional machine learning methods, deep learning architectures, and complex learning approaches to Arabic dialect identification.
arXiv Detail & Related papers (2020-09-26T15:33:16Z) - AraDIC: Arabic Document Classification using Image-Based Character
Embeddings and Class-Balanced Loss [7.734726150561088]
We propose a novel end-to-end Arabic document classification framework, Arabic document image-based classifier (AraDIC)
AraDIC consists of an image-based character encoder and a classifier. They are trained in an end-to-end fashion using the class balanced loss to deal with the long-tailed data distribution problem.
To the best of our knowledge, this is the first image-based character embedding framework addressing the problem of Arabic text classification.
arXiv Detail & Related papers (2020-06-20T14:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.