Related papers: Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use

Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use

URL: http://arxiv.org/abs/2506.18105v1
Date: Sun, 22 Jun 2025 17:26:09 GMT
Title: Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Authors: Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, Zhongdongming Dai,
Abstract summary: Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora.<n>We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only 85% on Appropriateness and 40% top-1 accuracy on Open Cloze.<n>Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage.
Score: 1.5129424416840094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.

Related papers

Evaluating LLMs on Chinese Idiom Translation [12.580058582681968]
Despite recent progress in machine translation, little is known about Chinese idiom translation.<n>We introduceEval, a framework with a comprehensive error taxonomy for Chinese idiom translation.
arXiv Detail & Related papers (2025-08-14T07:52:56Z)
Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity [16.065963688326242]
We study the trustworthiness of large language models (LLMs) when encountering ambiguous narrative text in Chinese.<n>We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs.<n>We discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans.
arXiv Detail & Related papers (2025-07-30T21:50:19Z)
SlangDIT: Benchmarking LLMs in Interpretative Slang Translation [89.48208612476068]
This paper introduces the interpretative slang translation task (named SlangDIT)<n>It consists of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context.<n>Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning.
arXiv Detail & Related papers (2025-05-20T10:37:34Z)
VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension [66.03062468036507]
We present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension.<n>VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models.
arXiv Detail & Related papers (2025-04-23T13:47:30Z)
Improving LLM Abilities in Idiomatic Translation [2.8692611791027893]
For language models (LLMs) like NLLB and GPT, translating idioms remains a challenge.<n>Our goal is to enhance translation fidelity by improving LLM processing of idiomatic language.<n>This has a significant social impact, as it preserves cultural nuances and ensures translated texts retain intent and emotional resonance.
arXiv Detail & Related papers (2024-07-03T21:34:26Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context? [64.38544995251642]
We study semantic ambiguities that exist in the source (English in this work) itself. We focus on idioms that are open to both literal and figurative interpretations. We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation.
arXiv Detail & Related papers (2023-10-23T06:38:49Z)
Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues. We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z)
We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding. We characterize ambiguity in a sentence by its effect on entailment relations with another sentence. We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z)
PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search [25.801066428860242]
We propose PiC - a dataset of 28K of noun phrases accompanied by their contextual Wikipedia pages. We find that training on our dataset improves ranking models' accuracy and remarkably pushes Question Answering (QA) models to near-human accuracy.
arXiv Detail & Related papers (2022-07-19T04:45:41Z)
Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension [22.25730077173127]
Machine reading comprehension (MRC) is the task that asks a machine to answer questions based on a given context. We first define the concept of literal meaning coverage to measure the consistency between semantics and literal meanings for Chinese idioms. To fully utilize the synonymic relationship, we propose the synonym knowledge enhanced reader. Experimental results on ChID, a large-scale Chinese idiom reading comprehension dataset, show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-11-09T15:28:53Z)
A BERT-based Dual Embedding Model for Chinese Idiom Prediction [8.903106634925853]
Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank. We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms.
arXiv Detail & Related papers (2020-11-04T16:12:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.