Related papers: SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection

URL: http://arxiv.org/abs/2404.03732v1
Date: Thu, 4 Apr 2024 18:01:21 GMT
Title: SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection
Authors: Bradley P. Allen, Fina Polat, Paul Groth,
Abstract summary: The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning to build classifiers for hallucination detection. It extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6.
Score: 1.3886978730184498
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We describe the University of Amsterdam Intelligent Data Engineering Lab team's entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system's classification decisions were consistent with those of the crowd-sourced human labellers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.

Related papers

Few-Shot Optimized Framework for Hallucination Detection in Resource-Limited NLP Systems [1.0124625066746595]
We introduce DeepSeek Few-shot optimization to enhance weak label generation through iterative prompt engineering. We achieve high-quality annotations that considerably enhanced the performance of downstream models. We further fine-tuned the Mistral-7B-Instruct-v0.3 model on these optimized annotations, enabling it to accurately detect hallucinations in resource-limited settings.
arXiv Detail & Related papers (2025-01-28T01:26:22Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks [10.867880635762395]
We introduce a novel approach named Instructive Code Retriever (ICR) ICR is designed to retrieve examples that enhance model inference across various code intelligence tasks and datasets. We evaluate our model's effectiveness on various tasks, i.e., code summarization, program synthesis, and bug fixing.
arXiv Detail & Related papers (2024-10-15T05:44:00Z)
The OCON model: an old but gold solution for distributable supervised classification [0.28675177318965045]
This paper introduces a structured application of the One-Class approach and the One-Class-One-Network model for supervised classification tasks. We achieve classification accuracy comparable to nowadays complex architectures (90.0 - 93.7%)
arXiv Detail & Related papers (2024-10-05T09:15:01Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection [51.99159169107426]
We present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards. We introduce three distinct methods that exhibit strong performance metrics.
arXiv Detail & Related papers (2024-04-09T09:03:44Z)
AISPACE at SemEval-2024 task 8: A Class-balanced Soft-voting System for Detecting Multi-generator Machine-generated Text [0.0]
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task.
arXiv Detail & Related papers (2024-04-01T06:25:47Z)
IUST_NLP at SemEval-2023 Task 10: Explainable Detecting Sexism with Transformers and Task-adaptive Pretraining [0.0]
This paper describes our system on SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) We propose a set of transformer-based pre-trained models with task-adaptive pretraining and ensemble learning. On the test dataset, our system achieves F1-scores of 83%, 64%, and 47% on subtasks A, B, and C, respectively.
arXiv Detail & Related papers (2023-05-11T15:29:04Z)
Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting. The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)
Yseop at SemEval-2020 Task 5: Cascaded BERT Language Model for Counterfactual Statement Analysis [0.0]
We use a BERT base model for the classification task and build a hybrid BERT Multi-Layer Perceptron system to handle the sequence identification task. Our experiments show that while introducing syntactic and semantic features does little in improving the system in the classification task, using these types of features as cascaded linear inputs to fine-tune the sequence-delimiting ability of the model ensures it outperforms other similar-purpose complex systems like BiLSTM-CRF in the second task.
arXiv Detail & Related papers (2020-05-18T08:19:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.