A Simple and Effective Method To Eliminate the Self Language Bias in
Multilingual Representations
- URL: http://arxiv.org/abs/2109.04727v1
- Date: Fri, 10 Sep 2021 08:15:37 GMT
- Title: A Simple and Effective Method To Eliminate the Self Language Bias in
Multilingual Representations
- Authors: Ziyi Yang, Yinfei Yang, Daniel Cer and Eric Darve
- Abstract summary: Language agnostic and semantic-language information isolation is an emerging research direction for multilingual representations models.
"Language Information Removal (LIR)" factors out language identity information from semantic related components in multilingual representations pre-trained on multi-monolingual data.
LIR reveals that for weak-alignment multilingual systems, the principal components of semantic spaces primarily encodes language identity information.
- Score: 7.571549274473274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language agnostic and semantic-language information isolation is an emerging
research direction for multilingual representations models. We explore this
problem from a novel angle of geometric algebra and semantic space. A simple
but highly effective method "Language Information Removal (LIR)" factors out
language identity information from semantic related components in multilingual
representations pre-trained on multi-monolingual data. A post-training and
model-agnostic method, LIR only uses simple linear operations, e.g. matrix
factorization and orthogonal projection. LIR reveals that for weak-alignment
multilingual systems, the principal components of semantic spaces primarily
encodes language identity information. We first evaluate the LIR on a
cross-lingual question answer retrieval task (LAReQA), which requires the
strong alignment for the multilingual embedding space. Experiment shows that
LIR is highly effectively on this task, yielding almost 100% relative
improvement in MAP for weak-alignment models. We then evaluate the LIR on
Amazon Reviews and XEVAL dataset, with the observation that removing language
information is able to improve the cross-lingual transfer performance.
Related papers
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Language Representations Can be What Recommenders Need: Findings and Potentials [57.90679739598295]
We show that item representations, when linearly mapped from advanced LM representations, yield superior recommendation performance.
This outcome suggests the possible homomorphism between the advanced language representation space and an effective item representation space for recommendation.
Our findings highlight the connection between language modeling and behavior modeling, which can inspire both natural language processing and recommender system communities.
arXiv Detail & Related papers (2024-07-07T17:05:24Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Discovering Low-rank Subspaces for Language-agnostic Multilingual
Representations [38.56175462620892]
Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer.
We present a novel view of projecting away language-specific factors from a multilingual embedding space.
We show that applying our method consistently leads to improvements over commonly used ML-LMs.
arXiv Detail & Related papers (2024-01-11T09:54:11Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Multilingual Entity and Relation Extraction from Unified to
Language-specific Training [29.778332361215636]
Existing approaches for entity and relation extraction tasks mainly focus on the English corpora and ignore other languages.
We propose a two-stage multilingual training method and a joint model called Multilingual Entity and Relation Extraction framework (mERE) to mitigate language interference.
Our method outperforms both the monolingual and multilingual baseline methods.
arXiv Detail & Related papers (2023-01-11T12:26:53Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Multilingual Transfer Learning for QA Using Translation as Data
Augmentation [13.434957024596898]
We explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space.
We propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance.
Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
arXiv Detail & Related papers (2020-12-10T20:29:34Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.