Related papers: Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

Related papers

Dhati+: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation [0.0]
Despite its significance, Arabic faces the challenge of being under-resourced.<n>The scarcity of large annotated datasets hampers the development of accurate tools for subjectivity analysis in Arabic.<n>Recent advances in deep learning and Transformers have proven highly effective for text classification in English and French.
arXiv Detail & Related papers (2025-08-27T15:20:12Z)
Enhanced Arabic Text Retrieval with Attentive Relevance Scoring [12.053940320312355]
Arabic poses a particular challenge for natural language processing and information retrieval.<n>Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources.<n>We present an enhanced Dense Passage Retrieval framework developed specifically for Arabic.
arXiv Detail & Related papers (2025-07-31T10:18:28Z)
Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset [11.204164618338863]
We introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses.<n>We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms.<n>Our results underscore both the difficulty of the task and the need for improved models and resources.
arXiv Detail & Related papers (2025-05-05T14:03:22Z)
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER -- a collection of multi-labeled datasets in 28 different languages. We describe the data collection and annotation processes and the challenges of building these datasets. We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z)
Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media. We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z)
Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain [5.916745177895035]
In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book. To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well.
arXiv Detail & Related papers (2023-06-22T16:50:40Z)
Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions. We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z)
Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z)
Towards Responsible Natural Language Annotation for the Varieties of Arabic [12.526184907781731]
We present a playbook for responsible dataset creation for polyglossic, multidialectal languages. This work is informed by a study on Arabic annotation of social media content.
arXiv Detail & Related papers (2022-03-17T20:23:27Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
A Multitask Learning Approach for Diacritic Restoration [21.288912928687186]
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
arXiv Detail & Related papers (2020-06-07T01:20:40Z)
Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.