Towards Automated Lexicography: Generating and Evaluating Definitions for Learner's Dictionaries
- URL: http://arxiv.org/abs/2601.01842v1
- Date: Mon, 05 Jan 2026 07:11:24 GMT
- Title: Towards Automated Lexicography: Generating and Evaluating Definitions for Learner's Dictionaries
- Authors: Yusuke Ide, Adam Nohejl, Joshua Tanner, Hitomi Yanaka, Christopher Lindsay, Taro Watanabe,
- Abstract summary: We study dictionary definition generation (DDG), i.e., the generation of non-contextualized definitions for given headwords.<n>Specifically, we address learner's dictionary definition generation (LDDG), where definitions should consist of simple words.
- Score: 37.91511820811209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study dictionary definition generation (DDG), i.e., the generation of non-contextualized definitions for given headwords. Dictionary definitions are an essential resource for learning word senses, but manually creating them is costly, which motivates us to automate the process. Specifically, we address learner's dictionary definition generation (LDDG), where definitions should consist of simple words. First, we introduce a reliable evaluation approach for DDG, based on our new evaluation criteria and powered by an LLM-as-a-judge. To provide reference definitions for the evaluation, we also construct a Japanese dataset in collaboration with a professional lexicographer. Validation results demonstrate that our evaluation approach agrees reasonably well with human annotators. Second, we propose an LDDG approach via iterative simplification with an LLM. Experimental results indicate that definitions generated by our approach achieve high scores on our criteria while maintaining lexical simplicity.
Related papers
- SciDef: Automating Definition Extraction from Academic Literature with Large Language Models [42.50759003781739]
SciDef is an LLM-based pipeline for automated definition extraction.<n>We test SciDef on DefExtra & DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity.
arXiv Detail & Related papers (2026-02-05T07:52:08Z) - Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers [51.64398574262054]
This paper introduces Term IDs (TIDs), defined as a set of semantically rich and standardized textual keywords, to serve as robust item identifiers.<n>We propose GRLM, a novel framework centered on TIDs, to convert item's metadata into standardized TIDs and utilize Integrative Instruction Fine-tuning to collaboratively optimize term internalization and sequential recommendation.
arXiv Detail & Related papers (2026-01-11T07:53:20Z) - Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.<n>Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.<n>We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.<n> Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - De-jargonizing Science for Journalists with GPT-4: A Pilot Study [3.730699089967391]
The system achieves fairly high recall in identifying jargon and preserves relative differences in readers' jargon identification.
The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
arXiv Detail & Related papers (2024-10-15T21:10:01Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Hierarchical Indexing for Retrieval-Augmented Opinion Summarization [60.5923941324953]
We propose a method for unsupervised abstractive opinion summarization that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs)
Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy.
At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews.
arXiv Detail & Related papers (2024-03-01T10:38:07Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Assisting Language Learners: Automated Trans-Lingual Definition
Generation via Contrastive Prompt Learning [25.851611353632926]
The standard definition generation task requires to automatically produce mono-lingual definitions.
We propose a novel task of Trans-Lingual Definition Generation (TLDG), which aims to generate definitions in another language.
arXiv Detail & Related papers (2023-06-09T17:32:45Z) - SSDL: Self-Supervised Dictionary Learning [20.925371262076744]
We propose a Self-Supervised Dictionary Learning (SSDL) framework to address this challenge.
Specifically, we first design a $p$-Laplacian Attention Hypergraph Learning block as the pretext task to generate pseudo soft labels for DL.
Then, we adopt the pseudo labels to train a dictionary from a primary label-embedded DL method.
arXiv Detail & Related papers (2021-12-03T08:55:08Z) - Toward Cross-Lingual Definition Generation for Language Learners [10.45755551957024]
We propose to generate definitions in English for words in various languages.
Models can be directly applied to other languages after trained on the English dataset.
Experiments and manual analyses show that our models have a strong cross-lingual transfer ability.
arXiv Detail & Related papers (2020-10-12T08:45:28Z) - VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word
Representations for Improved Definition Modeling [24.775371434410328]
We tackle the task of definition modeling, where the goal is to learn to generate definitions of words and phrases.
Existing approaches for this task are discriminative, combining distributional and lexical semantics in an implicit rather than direct way.
We propose a generative model for the task, introducing a continuous latent variable to explicitly model the underlying relationship between a phrase used within a context and its definition.
arXiv Detail & Related papers (2020-10-07T02:48:44Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.