Related papers: Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking

Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking

URL: http://arxiv.org/abs/2501.00244v1
Date: Tue, 31 Dec 2024 03:20:22 GMT
Title: Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
Authors: Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Shaokai Chen, Mengshu Sun, Binbin Hu, Zhiqiang Zhang, Lei Liang, Wen Zhang, Huajun Chen,
Abstract summary: This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality.<n>We introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.
Score: 44.66045367454493
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) have demonstrated exceptional performance in text generation within current NLP research. However, the lack of factual accuracy is still a dark cloud hanging over the LLM skyscraper. Structural knowledge prompting (SKP) is a prominent paradigm to integrate external knowledge into LLMs by incorporating structural representations, achieving state-of-the-art results in many knowledge-intensive tasks. However, existing methods often focus on specific problems, lacking a comprehensive exploration of the generalization and capability boundaries of SKP. This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality. To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.

Related papers

SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs [29.88977150203991]
We introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text.<n>We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units.<n>To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection.
arXiv Detail & Related papers (2025-07-23T03:52:24Z)
OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases [38.58409057214189]
textbftextscOneEval is a benchmark to assess the knowledge-intensive reasoning capabilities of Large Language Models (LLMs)<n>textscOneEval comprises 4,019 carefully curated instances and includes a challenging subset, textscOneEvaltextsubscriptHard, consisting of 1,285 particularly difficult cases.<n>We release the textscOneEval datasets, evaluation scripts, and baseline results publicly, accompanied by a leaderboard to facilitate ongoing advancements in structured knowledge reasoning.
arXiv Detail & Related papers (2025-06-14T17:16:05Z)
On Path to Multimodal Generalist: General-Level and General-Bench [153.9720740167528]
This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality.<n>At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation.<n>The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists.
arXiv Detail & Related papers (2025-05-07T17:59:32Z)
Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension [14.039653386385519]
Large language models (LLMs) acquire, retain, and apply knowledge.<n>This paper introduces a novel framework, K-(CSA)2, which categorizes LLM knowledge along two dimensions: correctness and confidence.
arXiv Detail & Related papers (2025-01-02T16:34:10Z)
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization [94.31508613367296]
Retrieval-augmented generation (RAG) is a key means to effectively enhance large language models (LLMs) We propose StructRAG, which can identify the optimal structure type for the task at hand, reconstruct original documents into this structured format, and infer answers based on the resulting structure. Experiments show that StructRAG achieves state-of-the-art performance, particularly excelling in challenging scenarios.
arXiv Detail & Related papers (2024-10-11T13:52:44Z)
Can large language models understand uncommon meanings of common words? [30.527834781076546]
Large language models (LLMs) have shown significant advancements across diverse natural language understanding (NLU) tasks. Yet, lacking widely acknowledged testing mechanisms, answering whether LLMs are parrots or genuinely comprehend the world' remains unclear. This paper presents innovative construction of a Lexical Semantic dataset with novel evaluation metrics.
arXiv Detail & Related papers (2024-05-09T12:58:22Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing [21.760293271882997]
Multimodal knowledge editing represents a critical advancement in enhancing the capabilities of Multimodal Large Language Models (MLLMs) Current benchmarks predominantly focus on coarse-grained knowledge, leaving the intricacies of fine-grained (FG) multimodal entity knowledge largely unexplored. To bridge this gap, we introduce MIKE, a comprehensive benchmark and dataset specifically designed for the FG multimodal entity knowledge editing.
arXiv Detail & Related papers (2024-02-18T07:15:03Z)
A Comprehensive Study of Knowledge Editing for Large Language Models [82.65729336401027]
Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. This paper defines the knowledge editing problem and provides a comprehensive review of cutting-edge approaches. We introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches.
arXiv Detail & Related papers (2024-01-02T16:54:58Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models [49.23348672822087]
We propose Knowledge Crosswords, a benchmark consisting of incomplete knowledge networks bounded by structured factual constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords.
arXiv Detail & Related papers (2023-10-02T15:43:53Z)
Metacognitive Prompting Improves Understanding in Large Language Models [12.112914393948415]
We introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. We conduct experiments on four prevalent Large Language Models (LLMs) across ten natural language understanding (NLU) datasets. MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks.
arXiv Detail & Related papers (2023-08-10T05:10:17Z)
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models [68.18370230899102]
We investigate how to elicit compositional generalization capabilities in large language models (LLMs) We find that demonstrating both foundational skills and compositional examples grounded in these skills within the same prompt context is crucial. We show that fine-tuning LLMs with SKiC-style data can elicit zero-shot weak-to-strong generalization.
arXiv Detail & Related papers (2023-08-01T05:54:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.