Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
- URL: http://arxiv.org/abs/2601.09648v1
- Date: Wed, 14 Jan 2026 17:31:21 GMT
- Title: Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
- Authors: Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight, Daisy Lal, Gearóid Ó Donnchadha, Mícheál Ó Meachair, Scott Piao, Elaine Uí Dhonnchadha, Johanna Vuorinen, Yan Yabo, Xiaobin Yang,
- Abstract summary: We perform the largest semantic tagging evaluation of the rule based system using five different languages.<n>We train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups.<n>The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.
- Score: 2.570766532236233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.
Related papers
- BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages [4.279942349440352]
We present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages.<n>We construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages.<n>We analyze how language choice, both in the prompt instructions and document grounding, affects data quality.
arXiv Detail & Related papers (2025-11-13T14:12:44Z) - Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning [5.5119571570277826]
Cross-lingual word alignment plays a crucial role in various natural language processing tasks.
Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings.
We propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework.
arXiv Detail & Related papers (2024-07-06T11:56:41Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Graph Neural Network Enhanced Language Models for Efficient Multilingual
Text Classification [8.147244878591014]
We propose a multilingual disaster related text classification system which is capable to work under mono, cross and multi lingual scenarios.
Our end-to-end trainable framework combines the versatility of graph neural networks, by applying over the corpus.
We evaluate our framework over total nine English, Non-English and monolingual datasets in mono, cross and multi lingual classification scenarios.
arXiv Detail & Related papers (2022-03-06T09:05:42Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Syntax Representation in Word Embeddings and Neural Networks -- A Survey [4.391102490444539]
This paper covers approaches of evaluating the amount of syntactic information included in the representations of words.
We mainly summarize re-search on English monolingual data on language modeling tasks.
We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks.
arXiv Detail & Related papers (2020-10-02T15:44:58Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.