Explainable identification of similarities between entities for discovery in large text
- URL: http://arxiv.org/abs/2503.17605v1
- Date: Sat, 22 Mar 2025 01:20:43 GMT
- Title: Explainable identification of similarities between entities for discovery in large text
- Authors: Akhil Joshi, Sai Teja Erukude, Lior Shamir,
- Abstract summary: This study develops an n-gram analysis framework to compare documents automatically and uncover explainable similarities.<n>A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents.<n>Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the availability of virtually infinite number text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.
Related papers
- QUDsim: Quantifying Discourse Similarities in LLM-Generated Text [70.22275200293964]
We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression.
We then use this framework to build $textbfQUDsim$, a similarity metric that can detect discursive parallels between documents.
Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs.
arXiv Detail & Related papers (2025-04-12T23:46:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Cross-Modal Entity Matching for Visually Rich Documents [4.8119678510491815]
Visually rich documents utilize visual cues to augment their semantics.
Existing works that enable structured querying on these documents do not take this into account.
We propose Juno -- a cross-modal entity matching framework to address this limitation.
arXiv Detail & Related papers (2023-03-01T18:26:14Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Neural Graph Matching for Modification Similarity Applied to Electronic
Document Comparison [0.0]
Document comparison is a common task in the legal and financial industries.
In this paper, we present a novel neural graph matching approach applied to document comparison.
arXiv Detail & Related papers (2022-04-12T02:37:54Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Comparative Analysis of N-gram Text Representation on Igbo Text Document
Similarity [0.0]
The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online.
It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models.
arXiv Detail & Related papers (2020-04-01T12:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.