Informational Space of Meaning for Scientific Texts
- URL: http://arxiv.org/abs/2004.13717v1
- Date: Tue, 28 Apr 2020 14:26:12 GMT
- Title: Informational Space of Meaning for Scientific Texts
- Authors: Neslihan Suzen, Evgeny M. Mirkes, Alexander N. Gorban
- Abstract summary: We introduce the Meaning Space, in which the meaning of a word is represented by a vector of Relative Information Gain (RIG) about the subject categories that the text belongs to.
This new approach is applied to construct the Meaning Space based on Leicester Scientific Corpus (LSC) and Leicester Scientific Dictionary-Core (LScDC)
The most informative words are presented for 252 categories. The proposed model based on RIG is shown to have ability to stand out topic-specific words in categories.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Natural Language Processing, automatic extracting the meaning of texts
constitutes an important problem. Our focus is the computational analysis of
meaning of short scientific texts (abstracts or brief reports). In this paper,
a vector space model is developed for quantifying the meaning of words and
texts. We introduce the Meaning Space, in which the meaning of a word is
represented by a vector of Relative Information Gain (RIG) about the subject
categories that the text belongs to, which can be obtained from observing the
word in the text. This new approach is applied to construct the Meaning Space
based on Leicester Scientific Corpus (LSC) and Leicester Scientific
Dictionary-Core (LScDC). The LSC is a scientific corpus of 1,673,350 abstracts
and the LScDC is a scientific dictionary which words are extracted from the
LSC. Each text in the LSC belongs to at least one of 252 subject categories of
Web of Science (WoS). These categories are used in construction of vectors of
information gains. The Meaning Space is described and statistically analysed
for the LSC with the LScDC. The usefulness of the proposed representation model
is evaluated through top-ranked words in each category. The most informative n
words are ordered. We demonstrated that RIG-based word ranking is much more
useful than ranking based on raw word frequency in determining the
science-specific meaning and importance of a word. The proposed model based on
RIG is shown to have ability to stand out topic-specific words in categories.
The most informative words are presented for 252 categories. The new scientific
dictionary and the 103,998 x 252 Word-Category RIG Matrix are available online.
Analysis of the Meaning Space provides us with a tool to further explore
quantifying the meaning of a text using more complex and context-dependent
meaning models that use co-occurrence of words and their combinations.
Related papers
- Tsetlin Machine Embedding: Representing Words Using Logical Expressions [10.825099126920028]
We introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised.
The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee"
We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks.
arXiv Detail & Related papers (2023-01-02T15:02:45Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - An Informational Space Based Semantic Analysis for Scientific Texts [62.997667081978825]
This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts.
The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties.
The research in this paper conducts the base for the geometric representation of the meaning of texts.
arXiv Detail & Related papers (2022-05-31T11:19:32Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation [74.42107665213909]
Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing.
Due to their importance, identifying acronyms and corresponding phrases (AI) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding.
Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement.
arXiv Detail & Related papers (2020-10-28T00:12:36Z) - Principal Components of the Meaning [58.720142291102135]
We argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space.
This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains.
arXiv Detail & Related papers (2020-09-18T14:28:32Z) - Detecting New Word Meanings: A Comparison of Word Embedding Models in
Spanish [1.5356167668895644]
Semantic neologisms (SN) are words that acquire a new word meaning while maintaining their form.
To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies.
We examine the following word embedding models: Word2Vec, Sense2Vec, and FastText.
arXiv Detail & Related papers (2020-01-12T21:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.