Related papers: Impact of Subword Pooling Strategy on Cross-lingual Event Detection

Impact of Subword Pooling Strategy on Cross-lingual Event Detection

URL: http://arxiv.org/abs/2302.11365v2
Date: Thu, 23 Feb 2023 02:04:27 GMT
Title: Impact of Subword Pooling Strategy on Cross-lingual Event Detection
Authors: Shantanu Agarwal, Steven Fincke, Chris Jenkins, Scott Miller, Elizabeth Boschee
Abstract summary: A pooling strategy takes the subword representations as input and outputs a representation for the entire word. We show that the choice of pooling strategy can have a significant impact on the target language performance. We carry out our analysis with five different pooling strategies across nine languages in diverse multi-lingual datasets.
Score: 2.3361634876233817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have significantly advanced the state-of-the-art for zero-shot cross-lingual information extraction. These language models ubiquitously rely on word segmentation techniques that break a word into smaller constituent subwords. Therefore, all word labeling tasks (e.g. named entity recognition, event detection, etc.), necessitate a pooling strategy that takes the subword representations as input and outputs a representation for the entire word. Taking the task of cross-lingual event detection as a motivating example, we show that the choice of pooling strategy can have a significant impact on the target language performance. For example, the performance varies by up to 16 absolute $f_{1}$ points depending on the pooling strategy when training in English and testing in Arabic on the ACE task. We carry out our analysis with five different pooling strategies across nine languages in diverse multi-lingual datasets. Across configurations, we find that the canonical strategy of taking just the first subword to represent the entire word is usually sub-optimal. On the other hand, we show that attention pooling is robust to language and dataset variations by being either the best or close to the optimal strategy. For reproducibility, we make our code available at https://github.com/isi-boston/ed-pooling.

Related papers

Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks. We investigate how key algorithmic design choices impact downstream models' performances.
arXiv Detail & Related papers (2025-02-21T09:58:54Z)
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text? [38.1823640848362]
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue.
arXiv Detail & Related papers (2024-06-17T12:42:34Z)
Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification. Our approach involves blending supervised data from different languages during training to create a universal model. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z)
DeMuX: Data-efficient Multilingual Learning [57.37123046817781]
DEMUX is a framework that prescribes exact data-points to label from vast amounts of unlabelled multilingual data. Our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations.
arXiv Detail & Related papers (2023-11-10T20:09:08Z)
Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods. Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z)
Meta-Learning a Cross-lingual Manifold for Semantic Parsing [75.26271012018861]
Localizing a semantic to support new languages requires effective cross-lingual generalization. We introduce a first-order meta-learning algorithm to train a semantic annotated with maximal sample efficiency during cross-lingual transfer. Results across six languages on ATIS demonstrate that our combination of steps yields accurate semantics sampling $le$10% of source training data in each new language.
arXiv Detail & Related papers (2022-09-26T10:42:17Z)
Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction [42.138153925505435]
We show that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.
arXiv Detail & Related papers (2021-09-14T16:21:14Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
Subword Pooling Makes a Difference [0.0]
We investigate how the choice of subword pooling affects the downstream performance on three tasks. For morphological tasks, the widely used choose the first subword' is the worst strategy. For POS tagging both of these strategies perform poorly and the best choice is to use a small LSTM over the subwords.
arXiv Detail & Related papers (2021-02-22T09:59:30Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
LAReQA: Language-agnostic answer retrieval from a multilingual pool [29.553907688813347]
LAReQA tests for "strong" cross-lingual alignment. We find that augmenting training data via machine translation is effective. This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.
arXiv Detail & Related papers (2020-04-11T20:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.