EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric
Code Switching
- URL: http://arxiv.org/abs/2210.12540v1
- Date: Sat, 22 Oct 2022 20:05:40 GMT
- Title: EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric
Code Switching
- Authors: Chenxi Whitehouse, Fenia Christopoulou, Ignacio Iacobacci
- Abstract summary: Code-Switching offers language alignment at word- or phrase-level.
Existing approaches either use dictionaries or parallel sentences with word-alignment to generate CS data.
We propose EntityCS to capture fine-grained cross-lingual semantics without corrupting syntax.
- Score: 15.884119564193924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate alignment between languages is fundamental for improving
cross-lingual pre-trained language models (XLMs). Motivated by the natural
phenomenon of code-switching (CS) in multilingual speakers, CS has been used as
an effective data augmentation method that offers language alignment at word-
or phrase-level, in contrast to sentence-level via parallel instances. Existing
approaches either use dictionaries or parallel sentences with word-alignment to
generate CS data by randomly switching words in a sentence. However, such
methods can be suboptimal as dictionaries disregard semantics, and syntax might
become invalid after random word switching. In this work, we propose EntityCS,
a method that focuses on Entity-level Code-Switching to capture fine-grained
cross-lingual semantics without corrupting syntax. We use Wikidata and the
English Wikipedia to construct an entity-centric CS corpus by switching
entities to their counterparts in other languages. We further propose
entity-oriented masking strategies during intermediate model training on the
EntityCS corpus for improving entity prediction. Evaluation of the trained
models on four entity-centric downstream tasks shows consistent improvements
over the baseline with a notable increase of 10% in Fact Retrieval. We release
the corpus and models to assist research on code-switching and enriching XLMs
with external knowledge.
Related papers
- Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities.
To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z) - Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing [6.074150063191985]
Cross-Lingual Back-Parsing is a novel data augmentation methodology designed to enhance cross-lingual transfer for semantic parsing.
Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings.
arXiv Detail & Related papers (2024-10-01T08:53:38Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - End-to-End Speech Translation for Code Switched Speech [13.97982457879585]
Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages.
We focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation.
We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.
arXiv Detail & Related papers (2022-04-11T13:25:30Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.