BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
- URL: http://arxiv.org/abs/2510.20151v1
- Date: Thu, 23 Oct 2025 02:56:10 GMT
- Title: BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
- Authors: Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala,
- Abstract summary: BoundRL performs token-level text segmentation and label prediction for long structured texts.<n>Instead of generating complete contents for each segment, it generates only a sequence of starting tokens.<n>It reconstructs the complete contents by locating these tokens within the original texts.
- Score: 26.825801831400003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
Related papers
- HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition [4.5311655360445515]
Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data.<n>We propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies.<n>We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure.
arXiv Detail & Related papers (2025-12-04T17:35:05Z) - CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z) - Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z) - Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [3.9914181590063884]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z) - Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking [0.9968037829925942]
This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering.<n>During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations.<n> Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
arXiv Detail & Related papers (2025-07-14T05:21:58Z) - Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv Detail & Related papers (2024-10-16T17:59:32Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Composable Text Controls in Latent Space with ODEs [97.12426987887021]
This paper proposes a new efficient approach for composable text operations in the compact latent space of text.
By connecting pretrained LMs to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences.
Experiments show that composing those operators within our approach manages to generate or edit high-quality text.
arXiv Detail & Related papers (2022-08-01T06:51:45Z) - Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical
Supervision from Extractive Summaries [46.183289748907804]
We propose SOE, a pipelined system that outlines, outlining and elaborating for long text generation.
SOE produces long texts with significantly better quality, along with faster convergence speed.
arXiv Detail & Related papers (2020-10-14T13:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.