Related papers: A BERT-based Dual Embedding Model for Chinese Idiom Prediction

A BERT-based Dual Embedding Model for Chinese Idiom Prediction

URL: http://arxiv.org/abs/2011.02378v1
Date: Wed, 4 Nov 2020 16:12:39 GMT
Title: A BERT-based Dual Embedding Model for Chinese Idiom Prediction
Authors: Minghuan Tan and Jing Jiang
Abstract summary: Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank. We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms.
Score: 8.903106634925853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chinese idioms are special fixed phrases usually derived from ancient stories, whose meanings are oftentimes highly idiomatic and non-compositional. The Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank. We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms. Specifically, we first match the embedding of each candidate idiom with the hidden representation corresponding to the blank in the context. We then match the embedding of each candidate idiom with the hidden representations of all the tokens in the context thorough context pooling. We further propose to use two separate idiom embeddings for the two kinds of matching. Experiments on a recently released Chinese idiom cloze test dataset show that our proposed method performs better than the existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the improvement of performance.

Related papers

What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty [66.51974095399409]
We formalize quote recommendation as choosing contextually novel but semantically coherent quotations.<n>A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels.<n>A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias.
arXiv Detail & Related papers (2025-12-15T12:19:37Z)
Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use [1.5129424416840094]
Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora.<n>We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only 85% on Appropriateness and 40% top-1 accuracy on Open Cloze.<n>Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage.
arXiv Detail & Related papers (2025-06-22T17:26:09Z)
Semi-Supervised Learning for Bilingual Lexicon Induction [1.8130068086063336]
We consider the problem of aligning two sets of continuous word representations, corresponding to languages, to a common space in order to infer a bilingual lexicon. Our experiments on standard benchmarks, inferring dictionary from English to more than 20 languages, show that our approach consistently outperforms existing state of the art benchmark.
arXiv Detail & Related papers (2024-02-10T19:27:22Z)
That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context? [64.38544995251642]
We study semantic ambiguities that exist in the source (English in this work) itself. We focus on idioms that are open to both literal and figurative interpretations. We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation.
arXiv Detail & Related papers (2023-10-23T06:38:49Z)
Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z)
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z)
DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z)
UAlberta at SemEval 2022 Task 2: Leveraging Glosses and Translations for Multilingual Idiomaticity Detection [4.66831886752751]
We describe the University of Alberta systems for the SemEval-2022 Task 2 on multilingual idiomaticity detection. Under the assumption that idiomatic expressions are noncompositional, our first method integrates information on the meanings of the individual words of an expression into a binary classifier. Our second method translates an expression in context, and uses a lexical knowledge base to determine if the translation is literal.
arXiv Detail & Related papers (2022-05-27T16:35:00Z)
Chinese Idiom Paraphrasing [33.585450600066395]
Chinese idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Paraphrasing (CIP) CIP aims to rephrase idioms- sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning.
arXiv Detail & Related papers (2022-04-15T17:24:25Z)
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity. Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z)
Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension [22.25730077173127]
Machine reading comprehension (MRC) is the task that asks a machine to answer questions based on a given context. We first define the concept of literal meaning coverage to measure the consistency between semantics and literal meanings for Chinese idioms. To fully utilize the synonymic relationship, we propose the synonym knowledge enhanced reader. Experimental results on ChID, a large-scale Chinese idiom reading comprehension dataset, show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-11-09T15:28:53Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.