FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric
- URL: http://arxiv.org/abs/2203.08299v1
- Date: Tue, 15 Mar 2022 22:33:26 GMT
- Title: FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric
- Authors: Maximillian Chen, Caitlyn Chen, Xiao Yu, Zhou Yu
- Abstract summary: We present FastKASSIM, a metric for utterance- and document-level syntactic similarity.
It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels.
It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
- Score: 48.66580267438049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Syntax is a fundamental component of language, yet few metrics have been
employed to capture syntactic similarity or coherence at the utterance- and
document-level. The existing standard document-level syntactic similarity
metric is computationally expensive and performs inconsistently when faced with
syntactically dissimilar documents. To address these challenges, we present
FastKASSIM, a metric for utterance- and document-level syntactic similarity
which pairs and averages the most similar dependency parse trees between a pair
of documents based on tree kernels. FastKASSIM is more robust to syntactic
dissimilarities and differences in length, and runs up to to 5.2 times faster
than our baseline method over the documents in the r/ChangeMyView corpus.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Revisiting Code Similarity Evaluation with Abstract Syntax Tree Edit Distance [6.164970071786899]
We revisit recent code similarity evaluation metrics, particularly focusing on the application of Abstract Syntax Tree (AST) editing distance.
Our experiments showcase the effectiveness of AST editing distance in capturing intricate code structures, revealing a high correlation with established metrics.
We propose, optimize, and publish a metric that demonstrates effectiveness across all tested languages.
arXiv Detail & Related papers (2024-04-12T21:28:18Z) - Hexatagging: Projective Dependency Parsing as Tagging [63.5392760743851]
We introduce a novel dependency, the hexatagger, that constructs dependency trees by tagging the words in a sentence with elements from a finite set of possible tags.
Our approach is fully parallelizable at training time, i.e., the structure-building actions needed to build a dependency parse can be predicted in parallel to each other.
We achieve state-of-the-art performance of 96.4 LAS and 97.4 UAS on the Penn Treebank test set.
arXiv Detail & Related papers (2023-06-08T18:02:07Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - A Comparison of Approaches to Document-level Machine Translation [34.2276281264886]
This paper presents a systematic comparison of selected approaches to document-level phenomena evaluation suites.
We find that a simple method based purely on back-translating monolingual document-level data performs as well as much more elaborate alternatives.
arXiv Detail & Related papers (2021-01-26T19:21:09Z) - Syntactic representation learning for neural network based TTS with
syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information.
Experimental results demonstrate the effectiveness of our proposed approach.
For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z) - Text classification with word embedding regularization and soft
similarity measure [0.20999222360659603]
Two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance.
We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings.
We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.
arXiv Detail & Related papers (2020-03-10T22:07:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.