Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
- URL: http://arxiv.org/abs/2601.18788v1
- Date: Mon, 26 Jan 2026 18:54:34 GMT
- Title: Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
- Authors: Mumin Jia, Jairo Diaz-Rodriguez,
- Abstract summary: Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices.<n>We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective.<n>A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
Related papers
- Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z) - Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection [73.9189065770752]
Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies.<n>We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training.
arXiv Detail & Related papers (2026-01-30T13:12:12Z) - Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation [0.17240671897505613]
Dialogue Act (DA) annotation treats communicative or pedagogical intent as localized to individual utterances or turns.<n>We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria.<n>We find DA-awareness produces segments that are internally more consistent than text-only baselines.
arXiv Detail & Related papers (2026-01-17T14:17:13Z) - Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference [71.09125259964684]
Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs)<n>We study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context.<n>On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms.
arXiv Detail & Related papers (2025-12-04T08:21:33Z) - Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation [0.0]
Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data.<n>We prove consistency in the number of detected change points and weak consistency in locations under mild additional assumptions.<n>We present the first empirical study of KCPD for text segmentation with modern embeddings.
arXiv Detail & Related papers (2025-10-03T18:57:22Z) - Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z) - READER: Retrieval-Assisted Drafter for Efficient LLM Inference [0.0386965802948046]
Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on latency inference.<n>This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models.<n>We present READER, a speculative decoding framework that bypasses the training of the auxiliary draft model.
arXiv Detail & Related papers (2025-08-12T16:47:48Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Neighbor-Aware Calibration of Segmentation Networks with Penalty-Based
Constraints [19.897181782914437]
We propose a principled and simple solution based on equality constraints on the logit values, which enables to control explicitly both the enforced constraint and the weight of the penalty.
Our approach can be used to train a wide span of deep segmentation networks.
arXiv Detail & Related papers (2024-01-25T19:46:57Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.