Related papers: A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

URL: http://arxiv.org/abs/2510.11225v1
Date: Mon, 13 Oct 2025 10:09:38 GMT
Title: A Theorem-Proving-Based Evaluation of Neural Semantic Parsing
Authors: Hayate Funakura, Hyunsoo Kim, Koji Mineshima,
Abstract summary: We reassess evaluation by pairing graph-matching with automated theorem proving.<n>We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness.
Score: 4.422349568747053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

Related papers

Bridging Theory and Practice in Link Representation with Graph Neural Networks [15.088089745469652]
Graph Neural Networks (GNNs) are widely used to compute representations of node pairs for downstream tasks such as link prediction.<n>We introduce a unifying framework, $k_phi$-$k_rho$-$m$ framework, that subsumes existing message-passing link models.<n>We use a graph symmetry metric that quantifies the difficulty of distinguishing links and show that while expressive models may underperform on standard benchmarks, they significantly outperform simpler ones as symmetry increases.
arXiv Detail & Related papers (2025-06-30T16:22:15Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts [25.643876327918544]
Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution samples. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias. We propose MaNo which applies a data-dependent normalization on the logits to reduce prediction bias and takes the $L_p$ norm of the matrix of normalized logits as the estimation score.
arXiv Detail & Related papers (2024-05-29T10:45:06Z)
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning [27.224364543134094]
We introduce a novel logic-driven data augmentation approach, AMR-LDA.<n> AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph.<n>The modified AMR graphs are subsequently converted back into text to create augmented data.
arXiv Detail & Related papers (2023-05-21T23:16:26Z)
Neural-Symbolic Inference for Robust Autoregressive Graph Parsing via Compositional Uncertainty Quantification [28.084115398817016]
We study compositionality-aware approach to neural-symbolic inference informed by model confidence. We empirically investigate the approach in the English Resource (ERG) parsing problem on a diverse suite of standard in-domain and seven OOD Grammar. Our approach leads to 35.26% and 35.60% error reduction in aggregated S Grammar score over neural and symbolic approaches respectively, and 14% absolute accuracy gain in key tail linguistic categories over the neural model.
arXiv Detail & Related papers (2023-01-26T23:11:03Z)
GraphQ IR: Unifying Semantic Parsing of Graph Query Language with Intermediate Representation [91.27083732371453]
We propose a unified intermediate representation (IR) for graph query languages, namely GraphQ IR. With the IR's natural-language-like representation that bridges the semantic gap and its formally defined syntax that maintains the graph structure, neural semantic parsing can more effectively convert user queries into GraphQ IR. Our approach can consistently achieve state-of-the-art performance on KQA Pro, Overnight and MetaQA.
arXiv Detail & Related papers (2022-05-24T13:59:53Z)
Deep Probabilistic Graph Matching [72.6690550634166]
We propose a deep learning-based graph matching framework that works for the original QAP without compromising on the matching constraints. The proposed method is evaluated on three popularly tested benchmarks (Pascal VOC, Willow Object and SPair-71k) and it outperforms all previous state-of-the-arts on all benchmarks.
arXiv Detail & Related papers (2022-01-05T13:37:27Z)
Enforcing Consistency in Weakly Supervised Semantic Parsing [68.2211621631765]
We explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs. We find that a more consistent formalism leads to improved model performance even without consistency-based training.
arXiv Detail & Related papers (2021-07-13T03:48:04Z)
Evaluating Logical Generalization in Graph Neural Networks [59.70452462833374]
We study the task of logical generalization using graph neural networks (GNNs) Our benchmark suite, GraphLog, requires that learning algorithms perform rule induction in different synthetic logics. We find that the ability for models to generalize and adapt is strongly determined by the diversity of the logical rules they encounter during training.
arXiv Detail & Related papers (2020-03-14T05:45:55Z)
Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading [96.48553941812366]
Lip-reading aims to infer the speech content from the lip movement sequence. Traditional learning process of seq2seq models suffers from two problems. We propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems.
arXiv Detail & Related papers (2020-03-09T09:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.