Related papers: Neural Coreference Resolution for Arabic

Neural Coreference Resolution for Arabic

URL: http://arxiv.org/abs/2011.00286v1
Date: Sat, 31 Oct 2020 14:34:43 GMT
Title: Neural Coreference Resolution for Arabic
Authors: Abdulrahman Aloraini, Juntao Yu and Massimo Poesio
Abstract summary: We introduce a coreference resolution system for Arabic based on Lee et al's end to end architecture combined with the Arabic version of bert and an external mention detector. As far as we know, this is the first neural coreference resolution system aimed specifically to Arabic. It substantially outperforms the existing state of the art on OntoNotes 5.0 with a gain of 15.2 points conll F1.
Score: 12.986359659930146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: No neural coreference resolver for Arabic exists, in fact we are not aware of any learning-based coreference resolver for Arabic since (Bjorkelund and Kuhn, 2014). In this paper, we introduce a coreference resolution system for Arabic based on Lee et al's end to end architecture combined with the Arabic version of bert and an external mention detector. As far as we know, this is the first neural coreference resolution system aimed specifically to Arabic, and it substantially outperforms the existing state of the art on OntoNotes 5.0 with a gain of 15.2 points conll F1. We also discuss the current limitations of the task for Arabic and possible approaches that can tackle these challenges.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale [51.41777906371754]
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline.<n>A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic.<n>We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths.
arXiv Detail & Related papers (2025-09-17T14:19:28Z)
Enhanced Arabic Text Retrieval with Attentive Relevance Scoring [12.053940320312355]
Arabic poses a particular challenge for natural language processing and information retrieval.<n>Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources.<n>We present an enhanced Dense Passage Retrieval framework developed specifically for Arabic.
arXiv Detail & Related papers (2025-07-31T10:18:28Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
arXiv Detail & Related papers (2024-03-26T16:37:54Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language. The proposed framework can explain specific predictions by training a local surrogate explainable model. We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs) The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z)
A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends [15.302538985992518]
We provide a comprehensive review of the development of Arabic NER. Traditional Arabic NER systems focus on feature engineering and designing domain-specific rules. With the growth of pre-trained language model, Arabic NER yields better performance.
arXiv Detail & Related papers (2023-02-07T14:56:52Z)
ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation. To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z)
TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models [0.0]
Arabic is the language of the Holy Qur'an; the sacred text for 1.8 billion people across the world. We propose an ensemble learning model based on Arabic variants of BERT models. Our system achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test set.
arXiv Detail & Related papers (2022-06-03T13:00:48Z)
TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus [3.8580784887142774]
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC) Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters)
arXiv Detail & Related papers (2020-03-20T22:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.