Hybrid Improved Document-level Embedding (HIDE)
- URL: http://arxiv.org/abs/2006.01203v1
- Date: Mon, 1 Jun 2020 19:09:13 GMT
- Title: Hybrid Improved Document-level Embedding (HIDE)
- Authors: Satanik Mitra and Mamata Jenamani
- Abstract summary: We propose HIDE a Hybrid Improved Document level Embedding.
It incorporates domain information, parts of speech information and sentiment information into existing word embeddings such as GloVe and Word2Vec.
We show considerable improvement over the accuracy of existing pretrained word vectors such as GloVe and Word2Vec.
- Score: 5.33024001730262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, word embeddings are taking a significant role in sentiment
analysis. As the generation of word embeddings needs huge corpora, many
applications use pretrained embeddings. In spite of the success, word
embeddings suffers from certain drawbacks such as it does not capture sentiment
information of a word, contextual information in terms of parts of speech tags
and domain-specific information. In this work we propose HIDE a Hybrid Improved
Document level Embedding which incorporates domain information, parts of speech
information and sentiment information into existing word embeddings such as
GloVe and Word2Vec. It combine improved word embeddings into document level
embeddings. Further, Latent Semantic Analysis (LSA) has been used to represent
documents as a vectors. HIDE is generated, combining LSA and document level
embeddings, which is computed from improved word embeddings. We test HIDE with
six different datasets and shown considerable improvement over the accuracy of
existing pretrained word vectors such as GloVe and Word2Vec. We further compare
our work with two existing document level sentiment analysis approaches. HIDE
performs better than existing systems.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Amharic Text Clustering Using Encyclopedic Knowledge with Neural Word
Embedding [0.0]
We propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding.
Test results show that the use of EK with word embedding for document clustering improves the average accuracy over the use of only EK.
arXiv Detail & Related papers (2021-03-31T05:37:33Z) - EDS-MEMBED: Multi-sense embeddings based on enhanced distributional
semantic structures via a graph walk over word senses [0.0]
We leverage the rich semantic structures in WordNet to enhance the quality of multi-sense embeddings.
We derive new distributional semantic similarity measures for M-SE from prior ones.
We report evaluation results on 11 benchmark datasets involving WSD and Word Similarity tasks.
arXiv Detail & Related papers (2021-02-27T14:36:55Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Blind signal decomposition of various word embeddings based on join and
individual variance explained [11.542392473831672]
We propose to use a novel joint signal separation method - JIVE to jointly decompose various trained word embeddings into joint and individual components.
We conducted empirical study on word2vec, FastText and GLoVE trained on different corpus and with different dimensions.
We found that by mapping different word embeddings into the joint component, sentiment performance can be greatly improved for the original word embeddings with lower performance.
arXiv Detail & Related papers (2020-11-30T01:36:29Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z) - Document Network Projection in Pretrained Word Embedding Space [7.455546102930911]
We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents into a pretrained word embedding space.
We leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph)
The document representations can help to solve many information retrieval tasks, such as recommendation, classification and clustering.
arXiv Detail & Related papers (2020-01-16T10:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.