SKDU at De-Factify 4.0: Natural Language Features for AI-Generated Text-Detection
- URL: http://arxiv.org/abs/2503.22338v1
- Date: Fri, 28 Mar 2025 11:25:05 GMT
- Title: SKDU at De-Factify 4.0: Natural Language Features for AI-Generated Text-Detection
- Authors: Shrikant Malviya, Pablo Arnau-González, Miguel Arevalillo-Herráez, Stamos Katsigiannis,
- Abstract summary: NELA features significantly outperform RAIDAR features in distinguishing human-written and AI-generated text.<n>XGBoost emerged as the most effective, leveraging the rich feature sets to achieve high accuracy and generalisation.
- Score: 2.006175707670159
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has introduced new challenges in distinguishing human-written text from AI-generated content. In this work, we explored a pipelined approach for AI-generated text detection that includes a feature extraction step (i.e. prompt-based rewriting features inspired by RAIDAR and content-based features derived from the NELA toolkit) followed by a classification module. Comprehensive experiments were conducted on the Defactify4.0 dataset, evaluating two tasks: binary classification to differentiate human-written and AI-generated text, and multi-class classification to identify the specific generative model used to generate the input text. Our findings reveal that NELA features significantly outperform RAIDAR features in both tasks, demonstrating their ability to capture nuanced linguistic, stylistic, and content-based differences. Combining RAIDAR and NELA features provided minimal improvement, highlighting the redundancy introduced by less discriminative features. Among the classifiers tested, XGBoost emerged as the most effective, leveraging the rich feature sets to achieve high accuracy and generalisation.
Related papers
- LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.
It adopts prompt-tuning strategies to leverage attribute-based text knowledge.
Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z) - Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders [20.557610461777344]
We use Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream.<n>We identify both interpretable and efficient features, analyzing their semantics and relevance.<n>Our methods offer valuable insights into how texts from various models differ from human-written content.
arXiv Detail & Related papers (2025-03-05T15:33:52Z) - Sarang at DEFACTIFY 4.0: Detecting AI-Generated Text Using Noised Data and an Ensemble of DeBERTa Models [0.0]
This paper presents an effective approach to detect AI-generated text.<n>It was developed for the Defactify 4.0 shared task at the fourth workshop on multimodal fact checking and hate speech detection.<n>Our team (Sarang) achieved the 1st place in both tasks with F1 scores of 1.0 and 0.9531, respectively.
arXiv Detail & Related papers (2025-02-24T05:32:00Z) - Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings [5.257719744958367]
This thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs)
We develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy.
Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations.
arXiv Detail & Related papers (2024-08-28T09:07:30Z) - ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection [6.27025292177391]
ToBlend is a novel token-level ensemble text generation method to challenge the robustness of current AI-content detection approaches.
We find ToBlend significantly drops the performance of most mainstream AI-content detection methods.
arXiv Detail & Related papers (2024-02-17T02:25:57Z) - Successor Features for Efficient Multisubject Controlled Text Generation [48.37713738712319]
We introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) and language model rectification.
SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters.
To the best of our knowledge, our research represents the first application of successor features in text generation.
arXiv Detail & Related papers (2023-11-03T00:17:08Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Controllable Text Generation with Focused Variation [71.07811310799664]
Focused-Variation Network (FVN) is a novel model to control language generation.
FVN learns disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity.
We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.
arXiv Detail & Related papers (2020-09-25T06:31:06Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.