GuideWalk -- Heterogeneous Data Fusion for Enhanced Learning -- A Multiclass Document Classification Case
- URL: http://arxiv.org/abs/2404.18942v1
- Date: Thu, 25 Apr 2024 18:48:11 GMT
- Title: GuideWalk -- Heterogeneous Data Fusion for Enhanced Learning -- A Multiclass Document Classification Case
- Authors: Sarmad N. Mohammed, Semra Gündüç,
- Abstract summary: A new embedding method based on the graph structure of meaningful sentences is proposed.
The success of the proposed embedding method is tested in classification problems.
The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new embedding method based on the graph structure of the meaningful sentences is proposed. The design of the algorithm aims to construct an embedding vector that constitutes syntactic and semantic elements as well as the hidden content of the text data. The success of the proposed embedding method is tested in classification problems. Among the wide range of application areas, text classification is the best laboratory for embedding methods; the classification power of the method can be tested using dimensional reduction without any further processing. Furthermore, the method can be compared with different embedding algorithms and machine learning methods. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. The proposed embedding method shows significantly better classification for binary and multiclass datasets compared to well-known algorithms.
Related papers
- Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain [0.24999074238880487]
This paper presents an approach to natural language data stream classification using the sentence space method.
It allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data.
Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification.
arXiv Detail & Related papers (2024-07-15T15:23:21Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - Text Classification: A Perspective of Deep Learning Methods [0.0679877553227375]
This paper introduces deep learning-based text classification algorithms, including important steps required for text classification tasks.
At the end of the article, different deep learning text classification methods are compared and summarized.
arXiv Detail & Related papers (2023-09-24T21:49:51Z) - A Novel Ehanced Move Recognition Algorithm Based on Pre-trained Models
with Positional Embeddings [6.688643243555054]
The recognition of abstracts is crucial for effectively locating the content and clarifying the article.
This paper proposes a novel enhanced move recognition algorithm with an improved pre-trained model and a gated network with attention mechanism for unstructured abstracts of Chinese scientific and technological papers.
arXiv Detail & Related papers (2023-08-14T03:20:28Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms.
We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - A Framework and Benchmarking Study for Counterfactual Generating Methods
on Tabular Data [0.0]
Counterfactual explanations are viewed as an effective way to explain machine learning predictions.
There are already dozens of algorithms aiming to generate such explanations.
benchmarking study and framework can help practitioners in determining which technique and building blocks most suit their context.
arXiv Detail & Related papers (2021-07-09T21:06:03Z) - Unsupervised Document Embedding via Contrastive Augmentation [48.71917352110245]
We present a contrasting learning approach with data augmentation techniques to learn document representations in unsupervised manner.
Inspired by recent contrastive self-supervised learning algorithms used for image and pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases.
Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.
arXiv Detail & Related papers (2021-03-26T15:48:52Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.