Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration
- URL: http://arxiv.org/abs/2510.00890v1
- Date: Wed, 01 Oct 2025 13:35:14 GMT
- Title: Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration
- Authors: Zhen Yin, Shenghua Wang,
- Abstract summary: Sci-SpanDet is a structure-aware framework for detecting AI-generated scholarly texts.<n>It combines section-conditioned stylistic modeling with multi-level contrastive learning to capture human nuanced-AI differences.<n>It achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36.
- Score: 2.105564340986074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid adoption of large language models (LLMs) in scientific writing raises serious concerns regarding authorship integrity and the reliability of scholarly publications. Existing detection approaches mainly rely on document-level classification or surface-level statistical cues; however, they neglect fine-grained span localization, exhibit weak calibration, and often fail to generalize across disciplines and generators. To address these limitations, we present Sci-SpanDet, a structure-aware framework for detecting AI-generated scholarly texts. The proposed method combines section-conditioned stylistic modeling with multi-level contrastive learning to capture nuanced human-AI differences while mitigating topic dependence, thereby enhancing cross-domain robustness. In addition, it integrates BIO-CRF sequence labeling with pointer-based boundary decoding and confidence calibration to enable precise span-level detection and reliable probability estimates. Extensive experiments on a newly constructed cross-disciplinary dataset of 100,000 annotated samples generated by multiple LLM families (GPT, Qwen, DeepSeek, LLaMA) demonstrate that Sci-SpanDet achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36. Furthermore, it shows strong resilience under adversarial rewriting and maintains balanced accuracy across IMRaD sections and diverse disciplines, substantially surpassing existing baselines. To ensure reproducibility and to foster further research on AI-generated text detection in scholarly documents, the curated dataset and source code will be publicly released upon publication.
Related papers
- Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs [70.31435391393642]
We introduce a benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models.<n>We propose a unified framework based on layer-selective multimodal large language models (MLLMs)<n>Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression.
arXiv Detail & Related papers (2026-01-15T13:22:07Z) - A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text [0.0]
We propose a theoretically grounded hybrid ensemble that fuses three complementary detection paradigms.<n>The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score.<n>Our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text.
arXiv Detail & Related papers (2025-11-27T06:42:56Z) - Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z) - Diversity Boosts AI-Generated Text Detection [51.56484100374058]
DivEye is a novel framework that captures how unpredictability fluctuates across a text using surprisal-based features.<n>Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines.
arXiv Detail & Related papers (2025-09-23T10:21:22Z) - Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation [3.088244520495001]
A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text.<n>Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs)<n>The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts.
arXiv Detail & Related papers (2025-09-22T14:22:55Z) - DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z) - HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z) - HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring [14.887491317701997]
This paper explores the possibility of fine-grained MGT detection under human-AI coauthoring.<n>We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio.<n> Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score.
arXiv Detail & Related papers (2025-06-03T14:52:44Z) - When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research [19.97666809905332]
Large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.<n>Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists.
arXiv Detail & Related papers (2025-05-17T05:45:16Z) - Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z) - Unveiling Large Language Models Generated Texts: A Multi-Level Fine-Grained Detection Framework [9.976099891796784]
Large language models (LLMs) have transformed human writing by enhancing grammar correction, content expansion, and stylistic refinement.
Existing detection methods, which mainly rely on single-feature analysis and binary classification, often fail to effectively identify LLM-generated text in academic contexts.
We propose a novel Multi-level Fine-grained Detection framework that detects LLM-generated text by integrating low-level structural, high-level semantic, and deep-level linguistic features.
arXiv Detail & Related papers (2024-10-18T07:25:00Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.