Related papers: Systematic Evaluation of Knowledge Graph Repair with Large Language Models

Systematic Evaluation of Knowledge Graph Repair with Large Language Models

URL: http://arxiv.org/abs/2507.22419v1
Date: Wed, 30 Jul 2025 06:46:30 GMT
Title: Systematic Evaluation of Knowledge Graph Repair with Large Language Models
Authors: Tung-Wei Lin, Gabe Fierro, Han Li, Tianzhen Hong, Pierluigi Nuzzo, Alberto Sangiovanni-Vinentelli,
Abstract summary: We present a systematic approach for evaluating the quality of knowledge graph repairs with respect to constraint violations defined in shapes constraint language (SHACL)<n>Our method addresses this gap by systematically generating violations using a novel mechanism, termed violation-inducing operations (VIOs)<n>Results indicate that concise prompts containing both the relevant violated SHACL constraints and key contextual information from the knowledge graph yield the best performance.
Score: 12.105264212919018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a systematic approach for evaluating the quality of knowledge graph repairs with respect to constraint violations defined in shapes constraint language (SHACL). Current evaluation methods rely on \emph{ad hoc} datasets, which limits the rigorous analysis of repair systems in more general settings. Our method addresses this gap by systematically generating violations using a novel mechanism, termed violation-inducing operations (VIOs). We use the proposed evaluation framework to assess a range of repair systems which we build using large language models. We analyze the performance of these systems across different prompting strategies. Results indicate that concise prompts containing both the relevant violated SHACL constraints and key contextual information from the knowledge graph yield the best performance.

Related papers

Auditing Language Model Unlearning via Information Decomposition [68.48660428111593]
We introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID)<n>By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge.<n>Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
arXiv Detail & Related papers (2026-01-21T15:51:19Z)
Visibility Allocation Systems: How Algorithmic Design Shapes Online Visibility and Societal Outcomes [0.5863360388454261]
We introduce a formal framework for visibility allocation systems (VASs)<n>VASs decide which (processed) data to present a human user with.<n>We show how our framework can support ongoing AI-legislative efforts to locate obligations, quantify systemic risks, and enable adaptive compliance.
arXiv Detail & Related papers (2025-10-20T07:28:24Z)
Formal Analysis of Metastable Failures in Software Systems [5.436969030534807]
We provide the mathematical foundations of metastability in request-response server systems.<n>We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs.<n>We show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds.
arXiv Detail & Related papers (2025-10-03T22:44:07Z)
Understanding GUI Agent Localization Biases through Logit Sharpness [15.986679553468989]
Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions.<n>Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability.<n>We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics.
arXiv Detail & Related papers (2025-06-18T12:55:35Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.<n>This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.<n>Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
A Closer Look at System Prompt Robustness [2.5525497052179995]
Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures.<n>In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user.<n>We create realistic new evaluation and fine-tuning datasets based on prompts collected from OpenAI's GPT Store and HuggingFace's HuggingChat.
arXiv Detail & Related papers (2025-02-15T18:10:45Z)
DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models [39.493913608472404]
Large language model (LLM)-based Grammatical Error Correction (GEC) models often produce corrections that diverge from provided gold references.<n>This discrepancy undermines the reliability of traditional reference-based evaluation metrics.<n>We propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism.
arXiv Detail & Related papers (2024-12-17T11:54:16Z)
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems [16.102196839755823]
We introduce Translation Canvas, an explainable interface designed to pinpoint and analyze translation systems' performance. It supports fine-grained analysis by highlighting error spans with explanations and selectively displaying systems' predictions. According to human evaluation, Translation Canvas demonstrates superior performance over COMET and SacreBLEU packages.
arXiv Detail & Related papers (2024-10-07T16:54:18Z)
GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model [6.106667677504318]
Retrieval-Augmented Generation (RAG) systems are widely used across various industries for querying closed-domain and in-house knowledge bases. evaluating these systems presents significant challenges due to the private nature of closed-domain data and a scarcity of queries with verifiable ground truths. We introduce GRAMMAR, an evaluation framework comprising a grounded data generation process and an evaluation protocol that effectively pinpoints defective modules.
arXiv Detail & Related papers (2024-04-30T03:29:30Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions. We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods. We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction. RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z)
Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples [29.385242714424624]
evaluating robustness of machine-learning models to adversarial examples is a challenging problem. We define a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks. Our experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations.
arXiv Detail & Related papers (2021-06-18T06:57:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.