The Analysis about Building Cross-lingual Sememe Knowledge Base Based on
Deep Clustering Network
- URL: http://arxiv.org/abs/2208.05462v1
- Date: Wed, 10 Aug 2022 17:40:45 GMT
- Title: The Analysis about Building Cross-lingual Sememe Knowledge Base Based on
Deep Clustering Network
- Authors: Xiaoran Li and Toshiaki Takano
- Abstract summary: Sememe knowledge bases (KBs) contain words annotated with sememes.
We propose an unsupervised method based on a deep clustering network (DCN) to build a sememe KB.
- Score: 0.7310043452300736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A sememe is defined as the minimum semantic unit of human languages. Sememe
knowledge bases (KBs), which contain words annotated with sememes, have been
successfully applied to many NLP tasks, and we believe that by learning the
smallest unit of meaning, computers can more easily understand human language.
However, Existing sememe KBs are built on only manual annotation, human
annotations have personal understanding biases, and the meaning of vocabulary
will be constantly updated and changed with the times, and artificial methods
are not always practical. To address the issue, we propose an unsupervised
method based on a deep clustering network (DCN) to build a sememe KB, and you
can use any language to build a KB through this method. We first learn the
distributed representation of multilingual words, use MUSE to align them in a
single vector space, learn the multi-layer meaning of each word through the
self-attention mechanism, and use a DNC to cluster sememe features. Finally, we
completed the prediction using only the 10-dimensional sememe space in English.
We found that the low-dimensional space can still retain the main feature of
the sememes.
Related papers
- Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing
Idiomatic Translation with Language Models [57.60487455727155]
idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems.
Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context awareness.
We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this.
This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B)
arXiv Detail & Related papers (2023-08-26T21:38:31Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Automatic Construction of Sememe Knowledge Bases via Dictionaries [53.8700954466358]
Sememe knowledge bases (SKBs) enable sememes to be applied to natural language processing.
Most languages have no SKBs, and manual construction of SKBs is time-consuming and labor-intensive.
We propose a simple and fully automatic method of building an SKB via an existing dictionary.
arXiv Detail & Related papers (2021-05-26T14:41:01Z) - Linguistic Classification using Instance-Based Learning [0.0]
We take a contrarian approach and question the tree-based model that is rather restrictive.
For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model.
We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed.
arXiv Detail & Related papers (2020-12-02T04:12:10Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - LSCP: Enhanced Large Scale Colloquial Persian Language Understanding [2.7249643773851724]
"Large Scale Colloquial Persian dataset" aims to describe the colloquial language of low-resourced languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
arXiv Detail & Related papers (2020-03-13T22:24:14Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.