xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization
- URL: http://arxiv.org/abs/2310.11275v1
- Date: Tue, 17 Oct 2023 13:53:57 GMT
- Title: xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization
- Authors: Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich,
Matthieu-P. Schapranow
- Abstract summary: We introduce xMEN, a modular system for cross-lingual medical entity normalization.
When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation.
For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available.
- Score: 0.42292483435853323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: To improve performance of medical entity normalization across many
languages, especially when fewer language resources are available compared to
English.
Materials and Methods: We introduce xMEN, a modular system for cross-lingual
medical entity normalization, which performs well in both low- and
high-resource scenarios. When synonyms in the target language are scarce for a
given terminology, we leverage English aliases via cross-lingual candidate
generation. For candidate ranking, we incorporate a trainable cross-encoder
model if annotations for the target task are available. We also evaluate
cross-encoders trained in a weakly supervised manner based on
machine-translated datasets from a high resource domain. Our system is publicly
available as an extensible Python toolkit.
Results: xMEN improves the state-of-the-art performance across a wide range
of multilingual benchmark datasets. Weakly supervised cross-encoders are
effective when no training data is available for the target task. Through the
compatibility of xMEN with the BigBIO framework, it can be easily used with
existing and prospective datasets.
Discussion: Our experiments show the importance of balancing the output of
general-purpose candidate generators with subsequent trainable re-rankers,
which we achieve through a rank regularization term in the loss function of the
cross-encoder. However, error analysis reveals that multi-word expressions and
other complex entities are still challenging.
Conclusion: xMEN exhibits strong performance for medical entity normalization
in multiple languages, even when no labeled data and few terminology aliases
for the target language are available. Its configuration system and evaluation
modules enable reproducible benchmarks. Models and code are available online at
the following URL: https://github.com/hpi-dhc/xmen
Related papers
- LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation [21.980770995466134]
We introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages.
This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling.
arXiv Detail & Related papers (2024-02-18T07:24:34Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.