Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation
- URL: http://arxiv.org/abs/2510.03437v1
- Date: Fri, 03 Oct 2025 18:57:22 GMT
- Title: Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation
- Authors: Jairo Diaz-Rodriguez, Mumin Jia,
- Abstract summary: Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data.<n>We prove consistency in the number of detected change points and weak consistency in locations under mild additional assumptions.<n>We present the first empirical study of KCPD for text segmentation with modern embeddings.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.
Related papers
- Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings [0.0]
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices.<n>We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective.<n>A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
arXiv Detail & Related papers (2026-01-26T18:54:34Z) - RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Statistical Depth for Ranking and Characterizing Transformer-Based Text
Embeddings [1.321681963474017]
A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution.
We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines.
arXiv Detail & Related papers (2023-10-23T15:02:44Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)<n>We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.<n>Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - Conditional Independence Testing via Latent Representation Learning [2.566492438263125]
LCIT (Latent representation based Conditional Independence Test) is a novel non-parametric method for conditional independence testing based on representation learning.
Our main contribution involves proposing a generative framework in which to test for the independence between X and Y given Z.
arXiv Detail & Related papers (2022-09-04T07:16:03Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Conceptual Text Region Network: Cognition-Inspired Accurate Scene Text
Detection [7.716899861923764]
We propose a human cognition-inspired framework, termed Conceptual Text Region Network (CTRNet)
CTRNet utilizes Conceptual Text Regions (CTRs), which is a class of cognition-based tools inheriting good mathematical properties, allowing for sophisticated label design.
CTRNet achieves state-of-the-art performance on benchmark CTW1500, Total-Text, MSRA-TD500, and ICDAR 2015 datasets, yielding performance gains of up to 2.0%.
arXiv Detail & Related papers (2021-03-16T16:28:33Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron.
It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information.
Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.