A Aelf-supervised Tibetan-chinese Vocabulary Alignment Method Based On
Adversarial Learning
- URL: http://arxiv.org/abs/2110.01258v1
- Date: Mon, 4 Oct 2021 08:56:33 GMT
- Title: A Aelf-supervised Tibetan-chinese Vocabulary Alignment Method Based On
Adversarial Learning
- Authors: Enshuai Hou and Jie zhu
- Abstract summary: This paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method.
The experimental results of Tibetan syllables Chinese characters are not good, which reflects the weak semantic correlation between Tibetan syllables and Chinese characters.
- Score: 3.553493344868414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tibetan is a low-resource language. In order to alleviate the shortage of
parallel corpus between Tibetan and Chinese, this paper uses two monolingual
corpora and a small number of seed dictionaries to learn the semi-supervised
method with seed dictionaries and self-supervised adversarial training method
through the similarity calculation of word clusters in different embedded
spaces and puts forward an improved self-supervised adversarial learning method
of Tibetan and Chinese monolingual data alignment only. The experimental
results are as follows. First, the experimental results of Tibetan syllables
Chinese characters are not good, which reflects the weak semantic correlation
between Tibetan syllables and Chinese characters; second, the seed dictionary
of semi-supervised method made before 10 predicted word accuracy of 66.5
(Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the
self-supervision methods in both language directions have reached 53.5
accuracy.
Related papers
- TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity [3.1854179230109363]
We propose a novel Tibetan adversarial text generation method called TSCheater.
It considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics.
Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation, semantic similarity, visual similarity, and human acceptance.
arXiv Detail & Related papers (2024-12-03T10:57:19Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - A Chinese Spelling Check Framework Based on Reverse Contrastive Learning [4.60495447017298]
We present a novel framework for Chinese spelling checking, which consists of three modules: language representation, spelling check and reverse contrastive learning.
Specifically, we propose a reverse contrastive learning strategy, which explicitly forces the model to minimize the agreement between the similar examples.
Experimental results show that our framework is model-agnostic and could be combined with existing Chinese spelling check models to yield state-of-the-art performance.
arXiv Detail & Related papers (2022-10-25T08:05:38Z) - Don't Forget Cheap Training Signals Before Building Unsupervised
Bilingual Word Embeddings [64.06041300946517]
We argue that easy-to-access cross-lingual signals should always be considered when developing unsupervised BWE methods.
We show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs.
Our results show that these training signals should not be neglected when building BWEs, even for distant languages.
arXiv Detail & Related papers (2022-05-31T12:00:55Z) - TiBERT: Tibetan Pre-trained Language Model [2.9554549423413303]
This paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95$%$ of the words in the corpus by using Sentencepiece.
We apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models.
arXiv Detail & Related papers (2022-05-15T14:45:08Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - A Supervised Word Alignment Method based on Cross-Language Span
Prediction using Multilingual BERT [22.701728185474195]
We first formalize a word alignment problem as a collection of independent predictions from a token in the source sentence to a span in the target sentence.
We then solve this problem by using multilingual BERT, which is fine-tuned on a manually created gold word alignment data.
We show that the proposed method significantly outperformed previous supervised and unsupervised word alignment methods without using any bitexts for pretraining.
arXiv Detail & Related papers (2020-04-29T23:40:08Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.