Related papers: BERT for Monolingual and Cross-Lingual Reverse Dictionary

BERT for Monolingual and Cross-Lingual Reverse Dictionary

URL: http://arxiv.org/abs/2009.14790v1
Date: Wed, 30 Sep 2020 17:00:10 GMT
Title: BERT for Monolingual and Cross-Lingual Reverse Dictionary
Authors: Hang Yan, Xiaonan Li, Xipeng Qiu
Abstract summary: We propose a simple but effective method to make BERT generate the target word for this specific task. By using the BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding.
Score: 56.8627517256663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reverse dictionary is the task to find the proper target word given the word description. In this paper, we tried to incorporate BERT into this task. However, since BERT is based on the byte-pair-encoding (BPE) subword encoding, it is nontrivial to make BERT generate a word given the description. We propose a simple but effective method to make BERT generate the target word for this specific task. Besides, the cross-lingual reverse dictionary is the task to find the proper target word described in another language. Previous models have to keep two different word embeddings and learn to align these embeddings. Nevertheless, by using the Multilingual BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding, and the alignment between languages is not necessary. More importantly, mBERT can achieve remarkable cross-lingual reverse dictionary performance even without the parallel corpus, which means it can conduct the cross-lingual reverse dictionary with only corresponding monolingual data. Code is publicly available at https://github.com/yhcc/BertForRD.git.

Related papers

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT [0.7874708385247353]
The multilingual Sentence-BERT (SBERT) models map different languages to common representation space. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach yields exceptional cross-lingual properties.
arXiv Detail & Related papers (2023-04-22T15:45:40Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)
MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z)
Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces. Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension. Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z)
Lacking the embedding of a word? Look it up into a traditional dictionary [0.2624902795082451]
We propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words. DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods for producing embeddings of unknown words.
arXiv Detail & Related papers (2021-09-24T06:27:58Z)
Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies. SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique. We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Looking for Clues of Language in Multilingual BERT to Improve Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information. We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.