BERT for Monolingual and Cross-Lingual Reverse Dictionary
- URL: http://arxiv.org/abs/2009.14790v1
- Date: Wed, 30 Sep 2020 17:00:10 GMT
- Title: BERT for Monolingual and Cross-Lingual Reverse Dictionary
- Authors: Hang Yan, Xiaonan Li, Xipeng Qiu
- Abstract summary: We propose a simple but effective method to make BERT generate the target word for this specific task.
By using the BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding.
- Score: 56.8627517256663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reverse dictionary is the task to find the proper target word given the word
description. In this paper, we tried to incorporate BERT into this task.
However, since BERT is based on the byte-pair-encoding (BPE) subword encoding,
it is nontrivial to make BERT generate a word given the description. We propose
a simple but effective method to make BERT generate the target word for this
specific task. Besides, the cross-lingual reverse dictionary is the task to
find the proper target word described in another language. Previous models have
to keep two different word embeddings and learn to align these embeddings.
Nevertheless, by using the Multilingual BERT (mBERT), we can efficiently
conduct the cross-lingual reverse dictionary with one subword embedding, and
the alignment between languages is not necessary. More importantly, mBERT can
achieve remarkable cross-lingual reverse dictionary performance even without
the parallel corpus, which means it can conduct the cross-lingual reverse
dictionary with only corresponding monolingual data. Code is publicly available
at https://github.com/yhcc/BertForRD.git.
Related papers
- L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence
representations using multilingual BERT [0.7874708385247353]
The multilingual Sentence-BERT (SBERT) models map different languages to common representation space.
We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus.
We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach yields exceptional cross-lingual properties.
arXiv Detail & Related papers (2023-04-22T15:45:40Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - Lacking the embedding of a word? Look it up into a traditional
dictionary [0.2624902795082451]
We propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words.
DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods for producing embeddings of unknown words.
arXiv Detail & Related papers (2021-09-24T06:27:58Z) - Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies.
SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique.
We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.