Related papers: OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

Related papers

A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models [6.780917788630485]
Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored.<n>We propose a hybrid framework that integrates embedding-based ML models with LLMs.
arXiv Detail & Related papers (2025-07-19T15:32:46Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Hallucination Detection in LLMs via Topological Divergence on Attention Graphs [64.74977204942199]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models. We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z)
OAEI-LLM-T: A TBox Benchmark Dataset for Understanding LLM Hallucinations in Ontology Matching Systems [0.0]
Hallucinations are inevitable in downstream tasks using large language models (LLMs) We introduce a new benchmark dataset called OAEI-LLM-T, capturing hallucinations of different LLMs performing OM tasks. These OM-specific hallucinations are carefully classified into two primary categories and six sub-categories.
arXiv Detail & Related papers (2025-03-25T18:20:04Z)
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models [15.156359255401812]
This paper introduces ODE, an open-set, dynamic protocol for evaluating object existence hallucinations in large language models (MLLMs) Our framework employs graph structures to model associations between real-word concepts and generates novel samples for both general and domain-specific scenarios. Experimental results show that MLLMs exhibit higher hallucination rates with ODE-generated samples, effectively avoiding data contamination.
arXiv Detail & Related papers (2024-09-14T05:31:29Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild [41.86776426516293]
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. We introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild.
arXiv Detail & Related papers (2024-03-07T08:25:46Z)
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs) This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z)
Large Language Models for Data Annotation: A Survey [49.8318827245266]
The emergence of advanced Large Language Models (LLMs) presents an unprecedented opportunity to automate the complicated process of data annotation. This survey includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation.
arXiv Detail & Related papers (2024-02-21T00:44:04Z)
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models [9.465753274663061]
Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs) This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains.
arXiv Detail & Related papers (2023-12-31T04:43:45Z)
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation [58.19101663976327]
Multi-modal Large Language Models (MLLMs) encounter the significant challenge of hallucinations. evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. We propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task.
arXiv Detail & Related papers (2023-11-13T15:25:42Z)
Local Large Language Models for Complex Structured Medical Tasks [0.0]
This paper introduces an approach that combines the language reasoning capabilities of large language models with the benefits of local training to tackle complex, domain-specific tasks. Specifically, the authors demonstrate their approach by extracting structured condition codes from pathology reports.
arXiv Detail & Related papers (2023-08-03T12:36:13Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.