Analysis and representation of Igbo text document for a text-based
system
- URL: http://arxiv.org/abs/2009.06376v1
- Date: Sat, 5 Sep 2020 19:07:17 GMT
- Title: Analysis and representation of Igbo text document for a text-based
system
- Authors: Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Adegbola Tunde
- Abstract summary: The interest of this paper is the Igbo language, which uses compounding as a common type of word formation and as well has many vocabularies of compound words.
The ambiguity in dealing with these compound words has made the representation of Igbo language text document very difficult.
This paper presents the analysis of Igbo language text document, considering its compounding nature and describes its representation with the Word-based N-gram model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advancement in Information Technology (IT) has assisted in inculcating
the three Nigeria major languages in text-based application such as text
mining, information retrieval and natural language processing. The interest of
this paper is the Igbo language, which uses compounding as a common type of
word formation and as well has many vocabularies of compound words. The issues
of collocation, word ordering and compounding play high role in Igbo language.
The ambiguity in dealing with these compound words has made the representation
of Igbo language text document very difficult because this cannot be addressed
using the most common and standard approach of the Bag-Of-Words (BOW) model of
text representation, which ignores the word order and relation. However, this
cause for a concern and the need to develop an improved model to capture this
situation. This paper presents the analysis of Igbo language text document,
considering its compounding nature and describes its representation with the
Word-based N-gram model to properly prepare it for any text-based application.
The result shows that Bigram and Trigram n-gram text representation models
provide more semantic information as well addresses the issues of compounding,
word ordering and collocations which are the major language peculiarities in
Igbo. They are likely to give better performance when used in any Igbo
text-based system.
Related papers
- BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer [3.1742013359102175]
We propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer for Bangla.
Our system aims to lemmatize words based on their parts of speech class within a given sentence.
The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained.
arXiv Detail & Related papers (2023-11-06T13:02:07Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Word Order Does Matter (And Shuffled Language Models Know It) [9.990431777927421]
Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE.
We investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order.
arXiv Detail & Related papers (2022-03-21T14:10:15Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - VLGrammar: Grounded Grammar Induction of Vision and Language [86.88273769411428]
We study grounded grammar induction of vision and language in a joint learning framework.
We present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously.
arXiv Detail & Related papers (2021-03-24T04:05:08Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - Comparative Analysis of N-gram Text Representation on Igbo Text Document
Similarity [0.0]
The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online.
It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models.
arXiv Detail & Related papers (2020-04-01T12:24:47Z) - A Survey on Contextual Embeddings [48.04732268018772]
Contextual embeddings assign each word a representation based on its context, capturing uses of words across varied contexts and encoding knowledge that transfers across languages.
We review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
arXiv Detail & Related papers (2020-03-16T15:22:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.