Related papers: FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

URL: http://arxiv.org/abs/2203.08299v1
Date: Tue, 15 Mar 2022 22:33:26 GMT
Title: FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric
Authors: Maximillian Chen, Caitlyn Chen, Xiao Yu, Zhou Yu
Abstract summary: We present FastKASSIM, a metric for utterance- and document-level syntactic similarity. It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels. It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
Score: 48.66580267438049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Syntax is a fundamental component of language, yet few metrics have been employed to capture syntactic similarity or coherence at the utterance- and document-level. The existing standard document-level syntactic similarity metric is computationally expensive and performs inconsistently when faced with syntactically dissimilar documents. To address these challenges, we present FastKASSIM, a metric for utterance- and document-level syntactic similarity which pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels. FastKASSIM is more robust to syntactic dissimilarities and differences in length, and runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.

Related papers

DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval [51.89673002051528]
DISRetrieval is a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding.<n>Our studies confirm that discourse structure significantly enhances retrieval effectiveness across different document lengths and query types.
arXiv Detail & Related papers (2025-05-26T14:45:12Z)
QUDsim: Quantifying Discourse Similarities in LLM-Generated Text [70.22275200293964]
We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $textbfQUDsim$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs.
arXiv Detail & Related papers (2025-04-12T23:46:09Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document. Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z)
Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance [6.164970071786899]
We revisit recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance. Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics. We propose, optimize, and publish a metric that demonstrates effectiveness across all tested languages.
arXiv Detail & Related papers (2024-04-12T21:28:18Z)
Hexatagging: Projective Dependency Parsing as Tagging [63.5392760743851]
We introduce a novel dependency, the hexatagger, that constructs dependency trees by tagging the words in a sentence with elements from a finite set of possible tags. Our approach is fully parallelizable at training time, i.e., the structure-building actions needed to build a dependency parse can be predicted in parallel to each other. We achieve state-of-the-art performance of 96.4 LAS and 97.4 UAS on the Penn Treebank test set.
arXiv Detail & Related papers (2023-06-08T18:02:07Z)
Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content. We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
A Comparison of Approaches to Document-level Machine Translation [34.2276281264886]
This paper presents a systematic comparison of selected approaches to document-level phenomena evaluation suites. We find that a simple method based purely on back-translating monolingual document-level data performs as well as much more elaborate alternatives.
arXiv Detail & Related papers (2021-01-26T19:21:09Z)
Syntactic representation learning for neural network based TTS with syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information. Experimental results demonstrate the effectiveness of our proposed approach. For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z)
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task. To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
Text classification with word embedding regularization and soft similarity measure [0.20999222360659603]
Two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance. We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.
arXiv Detail & Related papers (2020-03-10T22:07:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.