A Comparison of Latent Semantic Analysis and Correspondence Analysis for
Text Mining
- URL: http://arxiv.org/abs/2108.06197v1
- Date: Sun, 25 Jul 2021 09:10:10 GMT
- Title: A Comparison of Latent Semantic Analysis and Correspondence Analysis for
Text Mining
- Authors: Qianqian Qi, David J. Hessen, Peter G. M. van der Heijden
- Abstract summary: Both latent semantic analysis (LSA) and correspondence analysis (CA) use a singular value decomposition (SVD) for dimensionality reduction.
In this article, LSA and CA are compared from a theoretical point of view and applied in both a toy example and an authorship attribution example.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both latent semantic analysis (LSA) and correspondence analysis (CA) use a
singular value decomposition (SVD) for dimensionality reduction. In this
article, LSA and CA are compared from a theoretical point of view and applied
in both a toy example and an authorship attribution example. In text mining
interest goes out to the relationships among documents and terms: for example,
what terms are more often used in what documents. However, the LSA solution
displays a mix of marginal effects and these relationships. It appears that CA
has more attractive properties than LSA. One such property is that, in CA, the
effect of the margins is effectively eliminated, so that the CA solution is
optimally suited to focus on the relationships among documents and terms. Three
mechanisms are distinguished to weight documents and terms, and a unifying
framework is proposed that includes these three mechanisms and includes both CA
and LSA as special cases. In the authorship attribution example, the national
anthem of the Netherlands, the application of the discussed methods is
illustrated.
Related papers
- SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - A Theory for Token-Level Harmonization in Retrieval-Augmented Generation [76.75124161306795]
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs)
This paper provides a theory to explain and trade off the benefit and detriment in RAG.
Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG.
arXiv Detail & Related papers (2024-06-03T02:56:14Z) - A Novel Energy based Model Mechanism for Multi-modal Aspect-Based
Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis.
PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information.
EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z) - CausalCite: A Causal Formulation of Paper Citations [80.82622421055734]
CausalCite is a new way to measure the significance of a paper by assessing the causal impact of the paper on its follow-up papers.
It is based on a novel causal inference method, TextMatch, which adapts the traditional matching framework to high-dimensional text embeddings.
We demonstrate the effectiveness of CausalCite on various criteria, such as high correlation with paper impact as reported by scientific experts.
arXiv Detail & Related papers (2023-11-05T23:09:39Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents [0.5812284760539713]
We define this problem as "scarce annotated legal documents"
We propose a deep-learning-based classification framework which we call MESc.
We also propose an explanation extraction algorithm named ORSE.
arXiv Detail & Related papers (2023-09-19T12:18:28Z) - ConReader: Exploring Implicit Relations in Contracts for Contract Clause
Extraction [84.0634340572349]
We study automatic Contract Clause Extraction (CCE) by modeling implicit relations in legal contracts.
In this work, we first comprehensively analyze the complexity issues of contracts and distill out three implicit relations commonly found in contracts.
We propose a novel framework ConReader to exploit the above three relations for better contract understanding and improving CCE.
arXiv Detail & Related papers (2022-10-17T02:15:18Z) - A Zipf's Law-based Text Generation Approach for Addressing Imbalance in
Entity Extraction [19.55959053873699]
This paper proposes a novel approach by viewing the issue through the quantitative information.
It recognizes that entities exhibit certain levels of commonality while others are scarce, which can be reflected in the quantifiable distribution of words.
The Zipf's Law emerges as a well-suited adoption, and to transition from words to entities, words within the documents are classified as common and rare ones.
arXiv Detail & Related papers (2022-05-25T10:22:14Z) - Specialized Document Embeddings for Aspect-based Similarity of Research
Papers [4.661692753666685]
We treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces.
We represent a document not as a single generic embedding but as multiple specialized embeddings.
Our approach mitigates potential risks arising from implicit biases by making them explicit.
arXiv Detail & Related papers (2022-03-28T07:35:26Z) - Hierarchical Interaction Networks with Rethinking Mechanism for
Document-level Sentiment Analysis [37.20068256769269]
Document-level Sentiment Analysis (DSA) is more challenging due to vague semantic links and complicate sentiment information.
We study how to effectively generate a discriminative representation with explicit subject patterns and sentiment contexts for DSA.
We design a Sentiment-based Rethinking mechanism (SR) by refining the HIN with sentiment label information to learn a more sentiment-aware document representation.
arXiv Detail & Related papers (2020-07-16T16:27:38Z) - A Position Aware Decay Weighted Network for Aspect based Sentiment
Analysis [3.1473798197405944]
In ABSA, a text can have multiple sentiments depending upon each aspect.
Most of the existing approaches for ATSA, incorporate aspect information through a different subnetwork.
In this paper, we propose a model that leverages the positional information of the aspect.
arXiv Detail & Related papers (2020-05-03T09:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.