Related papers: Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration

Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration

URL: http://arxiv.org/abs/2510.00890v1
Date: Wed, 01 Oct 2025 13:35:14 GMT
Title: Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration
Authors: Zhen Yin, Shenghua Wang,
Abstract summary: Sci-SpanDet is a structure-aware framework for detecting AI-generated scholarly texts.<n>It combines section-conditioned stylistic modeling with multi-level contrastive learning to capture human nuanced-AI differences.<n>It achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36.
Score: 2.105564340986074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid adoption of large language models (LLMs) in scientific writing raises serious concerns regarding authorship integrity and the reliability of scholarly publications. Existing detection approaches mainly rely on document-level classification or surface-level statistical cues; however, they neglect fine-grained span localization, exhibit weak calibration, and often fail to generalize across disciplines and generators. To address these limitations, we present Sci-SpanDet, a structure-aware framework for detecting AI-generated scholarly texts. The proposed method combines section-conditioned stylistic modeling with multi-level contrastive learning to capture nuanced human-AI differences while mitigating topic dependence, thereby enhancing cross-domain robustness. In addition, it integrates BIO-CRF sequence labeling with pointer-based boundary decoding and confidence calibration to enable precise span-level detection and reliable probability estimates. Extensive experiments on a newly constructed cross-disciplinary dataset of 100,000 annotated samples generated by multiple LLM families (GPT, Qwen, DeepSeek, LLaMA) demonstrate that Sci-SpanDet achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36. Furthermore, it shows strong resilience under adversarial rewriting and maintains balanced accuracy across IMRaD sections and diverse disciplines, substantially surpassing existing baselines. To ensure reproducibility and to foster further research on AI-generated text detection in scholarly documents, the curated dataset and source code will be publicly released upon publication.

Related papers

Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs [70.31435391393642]
We introduce a benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models.<n>We propose a unified framework based on layer-selective multimodal large language models (MLLMs)<n>Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression.
arXiv Detail & Related papers (2026-01-15T13:22:07Z)
A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text [0.0]
We propose a theoretically grounded hybrid ensemble that fuses three complementary detection paradigms.<n>The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score.<n>Our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text.
arXiv Detail & Related papers (2025-11-27T06:42:56Z)
Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z)
Diversity Boosts AI-Generated Text Detection [51.56484100374058]
DivEye is a novel framework that captures how unpredictability fluctuates across a text using surprisal-based features.<n>Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines.
arXiv Detail & Related papers (2025-09-23T10:21:22Z)
Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation [3.088244520495001]
A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text.<n>Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs)<n>The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts.
arXiv Detail & Related papers (2025-09-22T14:22:55Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z)
HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring [14.887491317701997]
This paper explores the possibility of fine-grained MGT detection under human-AI coauthoring.<n>We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio.<n> Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score.
arXiv Detail & Related papers (2025-06-03T14:52:44Z)
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research [19.97666809905332]
Large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.<n>Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.
arXiv Detail & Related papers (2025-05-17T05:45:16Z)
Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z)
Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework [9.976099891796784]
Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement. Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts. We propose a novel Multi-level Fine-grained Detection framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features.
arXiv Detail & Related papers (2024-10-18T07:25:00Z)
On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z)
Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field. This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation. Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.