WikiTermBase: An AI-Augmented Term Base to Standardize Arabic Translation on Wikipedia
- URL: http://arxiv.org/abs/2505.20369v1
- Date: Mon, 26 May 2025 11:27:01 GMT
- Title: WikiTermBase: An AI-Augmented Term Base to Standardize Arabic Translation on Wikipedia
- Authors: Michel Bakni, Abbad Diraneyya, Wael Tellat,
- Abstract summary: This abstract introduces an open source tool, WikiTermBase, with a systematic approach for building a lexicographical database with over 900K terms.<n>The tool was successfully implemented on Arabic Wikipedia to standardize translated English and French terms.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Term bases are recognized as one of the most effective components of translation software in time saving and consistency. In spite of the many recent advances in natural language processing (NLP) and large language models (LLMs), major translation platforms have yet to take advantage of these tools to improve their term bases and support scalable content for underrepresented languages, which often struggle with localizing technical terminology. Language academies in the Arab World, for example, have struggled since the 1940s to unify the way new scientific terms enter the Arabic language at scale. This abstract introduces an open source tool, WikiTermBase, with a systematic approach for building a lexicographical database with over 900K terms, which were collected and mapped from a multitude of sources on a semantic and morphological basis. The tool was successfully implemented on Arabic Wikipedia to standardize translated English and French terms.
Related papers
- Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines [0.8944616102795021]
This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system.<n>We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers.<n>Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions.
arXiv Detail & Related papers (2025-04-30T09:56:36Z) - Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST) [19.91873751674613]
GIST is a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023.<n>The terms are translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation.<n>The dataset's quality is benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation.
arXiv Detail & Related papers (2024-12-24T11:50:18Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages [39.08623113730563]
We propose a novel use-case of Sanskrit-based segments for linguistically informed translation of technical terms.<n>Our approach uses character-level segmentation to identify meaningful subword units.<n>We observe consistent improvements in two experimental settings for technical term translation using Sanskrit-derived segments.
arXiv Detail & Related papers (2024-07-08T18:50:13Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - A Comprehensive Understanding of Code-mixed Language Semantics using
Hierarchical Transformer [28.3684494647968]
We propose a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages.
We evaluate the proposed method across 6 Indian languages and 9 NLP tasks on 17 datasets.
arXiv Detail & Related papers (2022-04-27T07:50:18Z) - AtteSTNet -- An attention and subword tokenization based approach for code-switched text hate speech detection [0.8287206589886882]
Language used in social media is often a combination of English and the native language in the region.<n>In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language.
arXiv Detail & Related papers (2021-12-10T20:01:44Z) - Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia.
This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention.
We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z) - LSCP: Enhanced Large Scale Colloquial Persian Language Understanding [2.7249643773851724]
"Large Scale Colloquial Persian dataset" aims to describe the colloquial language of low-resourced languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
arXiv Detail & Related papers (2020-03-13T22:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.