Comparative Analysis of N-gram Text Representation on Igbo Text Document
Similarity
- URL: http://arxiv.org/abs/2004.00375v2
- Date: Tue, 4 Aug 2020 00:34:22 GMT
- Title: Comparative Analysis of N-gram Text Representation on Igbo Text Document
Similarity
- Authors: Nkechi Ifeanyi-Reuben, Chidiebere Ugwu, Nwachukwu E.O
- Abstract summary: The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online.
It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The improvement in Information Technology has encouraged the use of Igbo in
the creation of text such as resources and news articles online. Text
similarity is of great importance in any text-based applications. This paper
presents a comparative analysis of n-gram text representation on Igbo text
document similarity. It adopted Euclidean similarity measure to determine the
similarities between Igbo text documents represented with two word-based n-gram
text representation (unigram and bigram) models. The evaluation of the
similarity measure is based on the adopted text representation models. The
model is designed with Object-Oriented Methodology and implemented with Python
programming language with tools from Natural Language Toolkits (NLTK). The
result shows that unigram represented text has highest distance values whereas
bigram has the lowest corresponding distance values. The lower the distance
value, the more similar the two documents and better the quality of the model
when used for a task that requires similarity measure. The similarity of two
documents increases as the distance value moves down to zero (0). Ideally, the
result analyzed revealed that Igbo text document similarity measured on bigram
represented text gives accurate similarity result. This will give better,
effective and accurate result when used for tasks such as text classification,
clustering and ranking on Igbo text.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Joint Representations of Text and Knowledge Graphs for Retrieval and
Evaluation [15.55971302563369]
A key feature of neural models is that they can produce semantic vector representations of objects (texts, images, speech, etc.) ensuring that similar objects are close to each other in the vector space.
While much work has focused on learning representations for other modalities, there are no aligned cross-modal representations for text and knowledge base elements.
arXiv Detail & Related papers (2023-02-28T17:39:43Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Comparing in context: Improving cosine similarity measures with a metric
tensor [0.0]
Cosine similarity is a widely used measure of the relatedness of pre-trained word embeddings, trained on a language modeling goal.
We propose instead the use of an extended cosine similarity measure to improve performance on that task, with gains in interpretability.
We learn contextualized metrics and compare the results with the baseline values obtained using the standard cosine similarity measure, which consistently shows improvement.
We also train a contextualized similarity measure for both SimLex-999 and WordSim-353, comparing the results with the corresponding baselines, and using these datasets as independent test sets for the all-context similarity measure learned on
arXiv Detail & Related papers (2022-03-28T18:04:26Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Hierarchical Heterogeneous Graph Representation Learning for Short Text
Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification.
First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs.
Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z) - Analysis and representation of Igbo text document for a text-based
system [0.0]
The interest of this paper is the Igbo language, which uses compounding as a common type of word formation and as well has many vocabularies of compound words.
The ambiguity in dealing with these compound words has made the representation of Igbo language text document very difficult.
This paper presents the analysis of Igbo language text document, considering its compounding nature and describes its representation with the Word-based N-gram model.
arXiv Detail & Related papers (2020-09-05T19:07:17Z) - MultiGBS: A multi-layer graph approach to biomedical summarization [6.11737116137921]
We propose a domain-specific method that models a document as a multi-layer graph to enable multiple features of the text to be processed at the same time.
The unsupervised method selects sentences from the multi-layer graph based on the MultiRank algorithm and the number of concepts.
The proposed MultiGBS algorithm employs UMLS and extracts the concepts and relationships using different tools such as SemRep, MetaMap, and OGER.
arXiv Detail & Related papers (2020-08-27T04:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.