Tracing the Genealogies of Ideas with Large Language Model Embeddings
- URL: http://arxiv.org/abs/2402.01661v1
- Date: Sat, 13 Jan 2024 18:42:27 GMT
- Title: Tracing the Genealogies of Ideas with Large Language Model Embeddings
- Authors: Lucian Li
- Abstract summary: I present a novel method to detect intellectual influence across a large corpus.
I apply this method to vectorize sentences from a corpus of roughly 400,000 nonfiction books and academic publications from the 19th century.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, I present a novel method to detect intellectual influence
across a large corpus. Taking advantage of the unique affordances of large
language models in encoding semantic and structural meaning while remaining
robust to paraphrasing, we can search for substantively similar ideas and hints
of intellectual influence in a computationally efficient manner. Such a method
allows us to operationalize different levels of confidence: we can allow for
direct quotation, paraphrase, or speculative similarity while remaining open
about the limitations of each threshold. I apply an ensemble method combining
General Text Embeddings, a state-of-the-art sentence embedding method optimized
to capture semantic content and an Abstract Meaning Representation graph
representation designed to capture structural similarities in argumentation
style and the use of metaphor. I apply this method to vectorize sentences from
a corpus of roughly 400,000 nonfiction books and academic publications from the
19th century for instances of ideas and arguments appearing in Darwin's
publications. This functions as an initial evaluation and proof of concept; the
method is not limited to detecting Darwinian ideas but is capable of detecting
similarities on a large scale in a wide range of corpora and contexts.
Related papers
- Conjuring Semantic Similarity [59.18714889874088]
The semantic similarity between two textual expressions measures the distance between their latent'meaning'
We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke.
Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models.
arXiv Detail & Related papers (2024-10-21T18:51:34Z) - CommunityFish: A Poisson-based Document Scaling With Hierarchical
Clustering [0.0]
This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus.
This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model.
arXiv Detail & Related papers (2023-08-28T19:52:18Z) - A Comparative Study of Sentence Embedding Models for Assessing Semantic
Variation [0.0]
We compare several recent sentence embedding methods via time-series of semantic similarity between successive sentences and matrices of pairwise sentence similarity for multiple books of literature.
We find that most of the sentence embedding methods considered do infer highly correlated patterns of semantic similarity in a given document, but show interesting differences.
arXiv Detail & Related papers (2023-08-08T23:31:10Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Measuring Fine-Grained Semantic Equivalence with Abstract Meaning
Representation [9.666975331506812]
Identifying semantically equivalent sentences is important for many NLP tasks.
Current approaches to semantic equivalence take a loose, sentence-level approach to "equivalence"
We introduce a novel, more sensitive method of characterizing semantic equivalence that leverages Abstract Meaning Representation graph structures.
arXiv Detail & Related papers (2022-10-06T16:08:27Z) - Fine-Grained Visual Entailment [51.66881737644983]
We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
arXiv Detail & Related papers (2022-03-29T16:09:38Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Multi-sense embeddings through a word sense disambiguation process [2.2344764434954256]
Most Suitable Sense.
(MSSA) disambiguates and annotates each word by its specific sense, considering the semantic effects of its context.
We test our approach on six different benchmarks for the word similarity task, showing that our approach can produce state-of-the-art results.
arXiv Detail & Related papers (2021-01-21T16:22:34Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - A computational model implementing subjectivity with the 'Room Theory'.
The case of detecting Emotion from Text [68.8204255655161]
This work introduces a new method to consider subjectivity and general context dependency in text analysis.
By using similarity measure between words, we are able to extract the relative relevance of the elements in the benchmark.
This method could be applied to all the cases where evaluating subjectivity is relevant to understand the relative value or meaning of a text.
arXiv Detail & Related papers (2020-05-12T21:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.