A BERT-based Dual Embedding Model for Chinese Idiom Prediction
- URL: http://arxiv.org/abs/2011.02378v1
- Date: Wed, 4 Nov 2020 16:12:39 GMT
- Title: A BERT-based Dual Embedding Model for Chinese Idiom Prediction
- Authors: Minghuan Tan and Jing Jiang
- Abstract summary: Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank.
We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms.
- Score: 8.903106634925853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chinese idioms are special fixed phrases usually derived from ancient
stories, whose meanings are oftentimes highly idiomatic and non-compositional.
The Chinese idiom prediction task is to select the correct idiom from a set of
candidate idioms given a context with a blank. We propose a BERT-based dual
embedding model to encode the contextual words as well as to learn dual
embeddings of the idioms. Specifically, we first match the embedding of each
candidate idiom with the hidden representation corresponding to the blank in
the context. We then match the embedding of each candidate idiom with the
hidden representations of all the tokens in the context thorough context
pooling. We further propose to use two separate idiom embeddings for the two
kinds of matching. Experiments on a recently released Chinese idiom cloze test
dataset show that our proposed method performs better than the existing state
of the art. Ablation experiments also show that both context pooling and dual
embedding contribute to the improvement of performance.
Related papers
- What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty [66.51974095399409]
We formalize quote recommendation as choosing contextually novel but semantically coherent quotations.<n>A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels.<n>A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias.
arXiv Detail & Related papers (2025-12-15T12:19:37Z) - Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use [1.5129424416840094]
Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora.<n>We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only 85% on Appropriateness and 40% top-1 accuracy on Open Cloze.<n>Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage.
arXiv Detail & Related papers (2025-06-22T17:26:09Z) - Semi-Supervised Learning for Bilingual Lexicon Induction [1.8130068086063336]
We consider the problem of aligning two sets of continuous word representations, corresponding to languages, to a common space in order to infer a bilingual lexicon.
Our experiments on standard benchmarks, inferring dictionary from English to more than 20 languages, show that our approach consistently outperforms existing state of the art benchmark.
arXiv Detail & Related papers (2024-02-10T19:27:22Z) - That was the last straw, we need more: Are Translation Systems Sensitive
to Disambiguating Context? [64.38544995251642]
We study semantic ambiguities that exist in the source (English in this work) itself.
We focus on idioms that are open to both literal and figurative interpretations.
We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation.
arXiv Detail & Related papers (2023-10-23T06:38:49Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - UAlberta at SemEval 2022 Task 2: Leveraging Glosses and Translations for
Multilingual Idiomaticity Detection [4.66831886752751]
We describe the University of Alberta systems for the SemEval-2022 Task 2 on multilingual idiomaticity detection.
Under the assumption that idiomatic expressions are noncompositional, our first method integrates information on the meanings of the individual words of an expression into a binary classifier.
Our second method translates an expression in context, and uses a lexical knowledge base to determine if the translation is literal.
arXiv Detail & Related papers (2022-05-27T16:35:00Z) - Chinese Idiom Paraphrasing [33.585450600066395]
Chinese idioms are hard to be understood by children and non-native speakers.
This study proposes a novel task, denoted as Chinese Paraphrasing (CIP)
CIP aims to rephrase idioms- sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning.
arXiv Detail & Related papers (2022-04-15T17:24:25Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - Synonym Knowledge Enhanced Reader for Chinese Idiom Reading
Comprehension [22.25730077173127]
Machine reading comprehension (MRC) is the task that asks a machine to answer questions based on a given context.
We first define the concept of literal meaning coverage to measure the consistency between semantics and literal meanings for Chinese idioms.
To fully utilize the synonymic relationship, we propose the synonym knowledge enhanced reader.
Experimental results on ChID, a large-scale Chinese idiom reading comprehension dataset, show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-11-09T15:28:53Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.