OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
- URL: http://arxiv.org/abs/2511.18622v1
- Date: Sun, 23 Nov 2025 21:33:53 GMT
- Title: OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
- Authors: Michael J. Bommarito,
- Abstract summary: OpenGloss is a synthetic encyclopedic dictionary and semantic knowledge graph for English.<n>It integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource.<n>The entire resource was produced in under one week for under $1,000.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
Related papers
- SciDef: Automating Definition Extraction from Academic Literature with Large Language Models [42.50759003781739]
SciDef is an LLM-based pipeline for automated definition extraction.<n>We test SciDef on DefExtra & DefSim, novel datasets of human-extracted definitions and definition-pairs' similarity.
arXiv Detail & Related papers (2026-02-05T07:52:08Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Vec2Gloss: definition modeling leveraging contextualized vectors with
Wordnet gloss [8.741676279851728]
We propose a Vec2Gloss' model, which produces the gloss from the target word's contextualized embeddings.
The generated glosses of this study are made possible by the systematic gloss patterns provided by Chinese Wordnet.
Our results indicate that the proposed Vec2Gloss' model opens a new perspective to the lexical-semantic applications of contextualized embeddings.
arXiv Detail & Related papers (2023-05-29T02:37:37Z) - Joint Language Semantic and Structure Embedding for Knowledge Graph
Completion [66.15933600765835]
We propose to jointly embed the semantics in the natural language description of the knowledge triplets with their structure information.
Our method embeds knowledge graphs for the completion task via fine-tuning pre-trained language models.
Our experiments on a variety of knowledge graph benchmarks have demonstrated the state-of-the-art performance of our method.
arXiv Detail & Related papers (2022-09-19T02:41:02Z) - Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy.
We present a new method that allows achieving high results on this task with little effort.
We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z) - Feature-rich multiplex lexical networks reveal mental strategies of
early language learning [0.7111443975103329]
We introduce FEature-Rich MUltiplex LEXical (FERMULEX) networks.
Similarities model heterogenous word associations across semantic/syntactic/phonological aspects of knowledge.
Words are enriched with multi-dimensional feature embeddings including frequency, age of acquisition, length and polysemy.
arXiv Detail & Related papers (2022-01-13T16:44:51Z) - Computational linguistic assessment of textbook and online learning
media by means of threshold concepts in business education [59.003956312175795]
From a linguistic perspective, threshold concepts are instances of specialized vocabularies, exhibiting particular linguistic features.
The profiles of 63 threshold concepts from business education have been investigated in textbooks, newspapers, and Wikipedia.
The three kinds of resources can indeed be distinguished in terms of their threshold concepts' profiles.
arXiv Detail & Related papers (2020-08-05T12:56:16Z) - A Broad-Coverage Deep Semantic Lexicon for Verbs [3.219005794369446]
COLLIE-V is a deep lexical resource for verbs with the coverage of WordNet and semantic details that meet or exceed existing resources.
New ontological concepts and lexical entries, together with semantic role preferences and entailment axioms, are automatically derived.
arXiv Detail & Related papers (2020-07-06T12:03:14Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z) - Automatic Compilation of Resources for Academic Writing and Evaluating
with Informal Word Identification and Paraphrasing System [24.42822218256954]
We present the first approach to automatically building resources for academic writing.
The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing.
arXiv Detail & Related papers (2020-03-05T22:55:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.