SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes
- URL: http://arxiv.org/abs/2504.11975v2
- Date: Mon, 28 Apr 2025 06:19:32 GMT
- Title: SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes
- Authors: Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovický, Jussi Karlgren, Shaoxiong Ji, Jindřich Helcl, Liane Guillou, Ona de Gibert, Jaione Bengoetxea, Joseph Attieh, Marianna Apidianaki,
- Abstract summary: We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs)<n>Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task.<n>We received 2,618 submissions from 43 participating teams employing diverse methodologies.
- Score: 72.61348252096413
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.
Related papers
- HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection [0.0]
HalluSearch is a pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs.<n>It couples retrieval-augmented verification with fine-grained factual splitting to identify and localize in 14 different languages.<n> Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech.
arXiv Detail & Related papers (2025-04-14T12:22:30Z) - HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection [1.8230982862848586]
We aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English.
We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples.
Results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations.
arXiv Detail & Related papers (2025-03-25T13:40:22Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.<n>This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.<n>Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures.
We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z) - MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM
Hallucination Detection [3.049887057143419]
In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges.
This often leads to neural networks exhibiting "hallucinations"
The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text.
arXiv Detail & Related papers (2024-03-01T20:31:10Z) - Unified Hallucination Detection for Multimodal Large Language Models [44.333451078750954]
Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination.
We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods.
We unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly.
arXiv Detail & Related papers (2024-02-05T16:56:11Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.