Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and
Manual Correction
- URL: http://arxiv.org/abs/2002.01761v1
- Date: Wed, 5 Feb 2020 12:44:01 GMT
- Title: Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and
Manual Correction
- Authors: Mingchen Li and Zili Zhou and Yanna Wang
- Abstract summary: Five Chinese wordnets have been developed to solve the problems of syntax and semantics.
They include: Northeastern University Chinese WordNet (NEW), Sinica Bilingual Ontological WordNet (BOW), Southeast University Chinese WordNet (SEW), Taiwan University Chinese WordNet (CWN), Chinese Open WordNet (COW)
We decided to make a new Chinese wordnet called Multi-Fusion Chinese Wordnet (MCW) to make up those shortcomings.
- Score: 7.471172518764192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Princeton WordNet (PWN) is a lexicon-semantic network based on cognitive
linguistics, which promotes the development of natural language processing.
Based on PWN, five Chinese wordnets have been developed to solve the problems
of syntax and semantics. They include: Northeastern University Chinese WordNet
(NEW), Sinica Bilingual Ontological WordNet (BOW), Southeast University Chinese
WordNet (SEW), Taiwan University Chinese WordNet (CWN), Chinese Open WordNet
(COW). By using them, we found that these word networks have low accuracy and
coverage, and cannot completely portray the semantic network of PWN. So we
decided to make a new Chinese wordnet called Multi-Fusion Chinese Wordnet (MCW)
to make up those shortcomings. The key idea is to extend the SEW with the help
of Oxford bilingual dictionary and Xinhua bilingual dictionary, and then
correct it. More specifically, we used machine learning and manual adjustment
in our corrections. Two standards were formulated to help our work. We
conducted experiments on three tasks including relatedness calculation, word
similarity and word sense disambiguation for the comparison of lemma's
accuracy, at the same time, coverage also was compared. The results indicate
that MCW can benefit from coverage and accuracy via our method. However, it
still has room for improvement, especially with lemmas. In the future, we will
continue to enhance the accuracy of MCW and expand the concepts in it.
Related papers
- Advancing the Arabic WordNet: Elevating Content Quality [8.438749883590216]
We introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality.
We update more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors.
In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
arXiv Detail & Related papers (2024-03-29T14:54:19Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Injecting Wiktionary to improve token-level contextual representations
using contrastive learning [2.761009930426063]
We investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary.
We also test how dimensionality reduction impacts the resulting contextual word embeddings.
arXiv Detail & Related papers (2024-02-12T17:22:42Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z) - Automatically constructing Wordnet synsets [2.363388546004777]
We propose approaches to generate Wordnet synsets for languages both resource-rich and resource-poor.
Our algorithms translate synsets of existing Wordnets to a target language T, then apply a ranking method on the translation candidates to find best translations in T.
arXiv Detail & Related papers (2022-08-08T02:02:18Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Towards Automatic Construction of Filipino WordNet: Word Sense Induction
and Synset Induction Using Sentence Embeddings [0.7214142393172727]
This study proposes a method for word sense induction and synset induction using only two linguistic resources.
The resulting sense inventory and synonym sets can be used in automatically creating a wordnet.
This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.
arXiv Detail & Related papers (2022-04-07T06:50:37Z) - "Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction [58.40808660657153]
We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT.
We construct a dataset including labels for 19,075 tokens in 10,448 sentences.
We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
arXiv Detail & Related papers (2022-03-01T08:24:56Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.