Does mBERT understand Romansh? Evaluating word embeddings using word
alignment
- URL: http://arxiv.org/abs/2306.08702v3
- Date: Thu, 17 Aug 2023 08:54:42 GMT
- Title: Does mBERT understand Romansh? Evaluating word embeddings using word
alignment
- Authors: Eyal Liron Dolev
- Abstract summary: We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh.
Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align.
We additionally present a gold standard for German-Romansh word alignment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We test similarity-based word alignment models (SimAlign and awesome-align)
in combination with word embeddings from mBERT and XLM-R on parallel sentences
in German and Romansh. Since Romansh is an unseen language, we are dealing with
a zero-shot setting. Using embeddings from mBERT, both models reach an
alignment error rate of 0.22, which outperforms fast_align, a statistical
model, and is on par with similarity-based word alignment for seen languages.
We interpret these results as evidence that mBERT contains information that can
be meaningful and applicable to Romansh.
To evaluate performance, we also present a new trilingual corpus, which we
call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton
of Grisons in German, Romansh and Italian in the past 25 years. The corpus
contains 4 547 parallel documents and approximately 100 000 sentence pairs in
each language combination. We additionally present a gold standard for
German-Romansh word alignment. The data is available at
https://github.com/eyldlv/DERMIT-Corpus.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - FreCDo: A Large Corpus for French Cross-Domain Dialect Identification [22.132457694021184]
We present a novel corpus for French dialect identification comprising 413,522 French text samples.
The training, validation and test splits are collected from different news websites.
This leads to a French cross-domain (FreCDo) dialect identification task.
arXiv Detail & Related papers (2022-12-15T10:32:29Z) - A New Aligned Simple German Corpus [2.7981463795578927]
We present a new sentence-aligned monolingual corpus for Simple German -- German.
It contains multiple document-aligned sources which we have aligned using automatic sentence-alignment methods.
The quality of our sentence alignments, as measured by F1-score, surpasses previous work.
arXiv Detail & Related papers (2022-09-02T15:14:04Z) - SLUA: A Super Lightweight Unsupervised Word Alignment Model via
Cross-Lingual Contrastive Learning [79.91678610678885]
We propose a super lightweight unsupervised word alignment model (SLUA)
Experimental results on several public benchmarks demonstrate that our model achieves competitive, if not better, performance.
Notably, we recognize our model as a pioneer attempt to unify bilingual word embedding and word alignments.
arXiv Detail & Related papers (2021-02-08T05:54:11Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Subword Sampling for Low Resource Word Alignment [4.663577299263155]
We propose subword sampling-based alignment of text units.
We show that the subword sampling method consistently outperforms word-level alignment on six language pairs.
arXiv Detail & Related papers (2020-12-21T19:47:04Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - SimAlign: High Quality Word Alignments without Parallel Training Data
using Static and Contextualized Embeddings [3.8424737607413153]
We propose word alignment methods that require no parallel data.
Key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment.
We find that alignments created from embeddings are superior for two language pairs compared to those produced by traditional statistical methods.
arXiv Detail & Related papers (2020-04-18T23:10:36Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.