Amharic Text Clustering Using Encyclopedic Knowledge with Neural Word
Embedding
- URL: http://arxiv.org/abs/2105.00809v2
- Date: Thu, 22 Sep 2022 14:46:31 GMT
- Title: Amharic Text Clustering Using Encyclopedic Knowledge with Neural Word
Embedding
- Authors: Dessalew Yohannes and Yeregal Assabie
- Abstract summary: We propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding.
Test results show that the use of EK with word embedding for document clustering improves the average accuracy over the use of only EK.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this digital era, almost in every discipline people are using automated
systems that generate information represented in document format in different
natural languages. As a result, there is a growing interest towards better
solutions for finding, organizing and analyzing these documents. In this paper,
we propose a system that clusters Amharic text documents using Encyclopedic
Knowledge (EK) with neural word embedding. EK enables the representation of
related concepts and neural word embedding allows us to handle the contexts of
the relatedness. During the clustering process, all the text documents pass
through preprocessing stages. Enriched text document features are extracted
from each document by mapping with EK and word embedding model. TF-IDF weighted
vector of enriched feature was generated. Finally, text documents are clustered
using popular spherical K-means algorithm. The proposed system is tested with
Amharic text corpus and Amharic Wikipedia data. Test results show that the use
of EK with word embedding for document clustering improves the average accuracy
over the use of only EK. Furthermore, changing the size of the class has a
significant effect on accuracy.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Enhanced Knowledge Selection for Grounded Dialogues via Document
Semantic Graphs [123.50636090341236]
We propose to automatically convert background knowledge documents into document semantic graphs.
Our document semantic graphs preserve sentence-level information through the use of sentence nodes and provide concept connections between sentences.
Our experiments show that our semantic graph-based knowledge selection improves over sentence selection baselines for both the knowledge selection task and the end-to-end response generation task on HollE.
arXiv Detail & Related papers (2022-06-15T04:51:32Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
Keywords extraction is called identifying words or phrases that express the main concepts of texts in best.
We use a combined approach that consists of two models of graph centrality features and textural features.
arXiv Detail & Related papers (2021-04-10T18:30:17Z) - Hybrid Improved Document-level Embedding (HIDE) [5.33024001730262]
We propose HIDE a Hybrid Improved Document level Embedding.
It incorporates domain information, parts of speech information and sentiment information into existing word embeddings such as GloVe and Word2Vec.
We show considerable improvement over the accuracy of existing pretrained word vectors such as GloVe and Word2Vec.
arXiv Detail & Related papers (2020-06-01T19:09:13Z) - GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z) - Heterogeneous Graph Neural Networks for Extractive Document
Summarization [101.17980994606836]
Cross-sentence relations are a crucial step in extractive document summarization.
We present a graph-based neural network for extractive summarization (HeterSumGraph)
We introduce different types of nodes into graph-based neural networks for extractive document summarization.
arXiv Detail & Related papers (2020-04-26T14:38:11Z) - Every Document Owns Its Structure: Inductive Text Classification via
Graph Neural Networks [22.91359631452695]
We propose TextING for inductive text classification via Graph Neural Networks (GNN)
We first build individual graphs for each document and then use GNN to learn the fine-grained word representations based on their local structures.
Our method outperforms state-of-the-art text classification methods.
arXiv Detail & Related papers (2020-04-22T07:23:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.