ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
- URL: http://arxiv.org/abs/2509.22246v1
- Date: Fri, 26 Sep 2025 12:02:58 GMT
- Title: ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
- Authors: Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo,
- Abstract summary: We introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score.<n>For rigorous validation, we present EPLA, a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet.<n>Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient.
- Score: 9.337443482551356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The benchmark, and implementation code will be made public soon.
Related papers
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation [11.450834626205676]
Table-BiEval is a novel approach based on a human-free, self-supervised evaluation framework.<n>It calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content.<n>Results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency.
arXiv Detail & Related papers (2026-01-09T07:38:27Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Semantic F1 Scores: Fair Evaluation Under Fuzzy Class Boundaries [65.89202599399252]
We propose Semantic F1 Scores, novel evaluation metrics for subjective or fuzzy multi-label classification.<n>By granting partial credit for semantically related but nonidentical labels, Semantic F1 better reflects the realities of domains marked by human disagreement or fuzzy category boundaries.
arXiv Detail & Related papers (2025-09-25T21:48:48Z) - SCORE: A Semantic Evaluation Framework for Generative Document Parsing [2.5101597298392098]
Multi-modal generative document parsing systems produce semantically correct yet structurally divergent outputs.<n>Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior.<n>We introduce SCORE, an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks.
arXiv Detail & Related papers (2025-09-16T16:06:19Z) - StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching [10.000850856259866]
StructCoh is a graph-enhanced contrastive learning framework.<n>A hierarchical contrastive objective enforces consistency at multiple granularities.<n>Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements.
arXiv Detail & Related papers (2025-09-02T07:21:36Z) - Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization [11.26658223467498]
GTED is an evaluation framework that standardizes formal statements and converts them into operator trees.<n>It determines the semantic similarity using the eponymous GTED metric.<n>GTED consistently ranks as a top-performing metric, achieving the highest accuracy and Kappa on miniF2F and the joint-highest accuracy on ProofNet.
arXiv Detail & Related papers (2025-07-10T03:34:58Z) - Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness [13.258954013620885]
CTSES is a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior, lexical quality, and structural alignment.<n>Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.
arXiv Detail & Related papers (2025-06-07T11:18:17Z) - QUDsim: Quantifying Discourse Similarities in LLM-Generated Text [70.22275200293964]
We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression.<n>We then use this framework to build $textbfQUDsim$, a similarity metric that can detect discursive parallels between documents.<n>Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs.
arXiv Detail & Related papers (2025-04-12T23:46:09Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency [22.86318578119266]
We introduce a novel framework that scores and selects the best result from k autoformalization candidates based on symbolic equivalence and semantic consistency.<n>Our experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy.
arXiv Detail & Related papers (2024-10-28T11:37:39Z) - Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning [54.69189620971405]
We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning.<n>IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable non-i.i.d. data.<n>We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results.
arXiv Detail & Related papers (2024-06-20T13:30:25Z) - Duality-Induced Regularizer for Semantic Matching Knowledge Graph
Embeddings [70.390286614242]
We propose a novel regularizer -- namely, DUality-induced RegulArizer (DURA) -- which effectively encourages the entities with similar semantics to have similar embeddings.
Experiments demonstrate that DURA consistently and significantly improves the performance of state-of-the-art semantic matching models.
arXiv Detail & Related papers (2022-03-24T09:24:39Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.