UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output
- URL: http://arxiv.org/abs/2505.03030v1
- Date: Mon, 05 May 2025 21:15:40 GMT
- Title: UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output
- Authors: Sicong Huang, Jincheng He, Shiyuan Huang, Karthik Raja Anandan, Arkajyoti Chakraborty, Ian Lane,
- Abstract summary: SemEval 2025 Task 3, Mu-SHROOM: Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction.<n>This paper describes the UCSC system submission to the shared Mu-SHROOM task.<n>We introduce a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans in the LLM output.
- Score: 7.121378498209948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint exactly where in the LLM output they occur. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes the UCSC system submission to the shared Mu-SHROOM task. We introduce a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans in the LLM output. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking #1 in average position across all languages. We release our code and experiment results.
Related papers
- TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification [2.3999111269325266]
This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes.<n>We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns.
arXiv Detail & Related papers (2025-07-01T09:00:50Z) - MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection [0.0]
This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes.<n>The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages.<n>Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
arXiv Detail & Related papers (2025-05-27T08:26:17Z) - Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? [7.416552590139255]
We evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks.<n>We study how model performance varies across tasks and language.<n>We find that performance varies across models but is consistent across prompts.
arXiv Detail & Related papers (2025-04-29T12:30:05Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning [2.313664320808389]
This study proposes an innovative model, Spatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL)
We design a path-planning algorithm based on Q-learning to mitigate the context inconsistency hallucination.
Using the Q-value of state-action as auxiliary information for prompts, we correct the hallucinations of LLMs.
arXiv Detail & Related papers (2024-08-23T16:02:54Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [76.69936664916061]
We study how the number of LM calls affects the performance of Vote and Filter-Vote.
We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls.
arXiv Detail & Related papers (2024-03-04T19:12:48Z) - kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest
Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language.
LLMs have achieved impressive performance in computer programs based on a natural language prompt.
This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.