Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
- URL: http://arxiv.org/abs/2307.04361v1
- Date: Mon, 10 Jul 2023 06:17:33 GMT
- Title: Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
- Authors: Hoang H. Nguyen, Chenwei Zhang, Tao Zhang, Eugene Rohrbaugh, Philip S.
Yu
- Abstract summary: PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
- Score: 57.109031654219294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous cross-lingual transfer methods are restricted to orthographic
representation learning via textual scripts. This limitation hampers
cross-lingual transfer and is biased towards languages sharing similar
well-known scripts. To alleviate the gap between languages from different
writing scripts, we propose PhoneXL, a framework incorporating phonemic
transcriptions as an additional linguistic modality beyond the traditional
orthographic transcriptions for cross-lingual transfer. Particularly, we
propose unsupervised alignment objectives to capture (1) local one-to-one
alignment between the two different modalities, (2) alignment via
multi-modality contexts to leverage information from additional modalities, and
(3) alignment via multilingual contexts where additional bilingual dictionaries
are incorporated. We also release the first phonemic-orthographic alignment
dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech
Tagging) among the understudied but interconnected
Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study reveals
phonemic transcription provides essential information beyond the orthography to
enhance cross-lingual transfer and bridge the gap among CJKV languages, leading
to consistent improvements on cross-lingual token-level tasks over
orthographic-based multilingual PLMs.
Related papers
- CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Multilingual Pre-training with Language and Task Adaptation for
Multilingual Text Style Transfer [14.799109368073548]
We exploit the pre-trained seq2seq model mBART for multilingual text style transfer.
Using machine translated data as well as gold aligned English sentences yields state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T11:27:48Z) - Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer.
Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.