HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection
- URL: http://arxiv.org/abs/2504.10168v1
- Date: Mon, 14 Apr 2025 12:22:30 GMT
- Title: HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection
- Authors: Mohamed A. Abdallah, Samhaa R. El-Beltagy,
- Abstract summary: HalluSearch is a pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs.<n>It couples retrieval-augmented verification with fine-grained factual splitting to identify and localize in 14 different languages.<n> Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system's retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.
Related papers
- SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes [72.61348252096413]
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs)
Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task.
We received 2,618 submissions from 43 participating teams employing diverse methodologies.
arXiv Detail & Related papers (2025-04-16T11:15:26Z) - AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection [4.8858843645116945]
We propose an efficient, training-free LLM prompting strategy that enhances hallucination detection by translating multilingual text spans into English.<n>Our approach achieves competitive rankings across multiple languages, securing two first positions in low-resource languages.
arXiv Detail & Related papers (2025-03-04T09:38:57Z) - Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness [30.00463676754559]
We introduce BordIRLines, a benchmark consisting of 720 territorial dispute queries paired with 14k Wikipedia documents across 49 languages.<n>Our experiments reveal that retrieving multilingual documents best improves response consistency and decreases geopolitical bias over using purely in-language documents.<n>Our further experiments and case studies investigate how cross-lingual RAG is affected by aspects from IR to document contents.
arXiv Detail & Related papers (2024-10-02T01:59:07Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - SLPL SHROOM at SemEval2024 Task 06: A comprehensive study on models ability to detect hallucination [1.4705596514165422]
This study explores methods for detecting hallucinations in three SemEval-2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation.
We evaluate two methods: semantic similarity between the generated text and factual references, and an ensemble of language models that judge each other's outputs.
arXiv Detail & Related papers (2024-04-07T07:34:49Z) - KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking [55.2155025063668]
KnowHalu is a novel approach for detecting hallucinations in text generated by large language models (LLMs)
It uses step-wise reasoning, multi-formulation query, multi-form knowledge for factual checking, and fusion-based detection mechanism.
Our evaluations demonstrate that KnowHalu significantly outperforms SOTA baselines in detecting hallucinations across diverse tasks.
arXiv Detail & Related papers (2024-04-03T02:52:07Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Hallucinations in Large Multilingual Translation Models [70.10455226752015]
Large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages.
When deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns.
Existing research on hallucinations has primarily focused on small bilingual models trained on high-resource languages.
arXiv Detail & Related papers (2023-03-28T16:17:59Z) - Matching Tweets With Applicable Fact-Checks Across Languages [27.762055254009017]
We focus on automatically finding existing fact-checks for claims made in social media posts (tweets)
We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings.
We present promising results for "match" classification (93% average accuracy) in four language pairs.
arXiv Detail & Related papers (2022-02-14T23:33:02Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.