T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
- URL: http://arxiv.org/abs/2603.03790v1
- Date: Wed, 04 Mar 2026 07:05:09 GMT
- Title: T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
- Authors: Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen,
- Abstract summary: We introduce Structure of Thought (SoT), a prompting technique that guides models to construct intermediate text structures.<n>Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models.
- Score: 31.85615810584119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.
Related papers
- PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning [55.78158607697319]
Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming.<n>We present PromptCoT 2.0, a scalable framework that replaces hand-crafted synthesiss with an expectation-maximization loop.<n>This produces problems that are both harder and more diverse than prior corpora.
arXiv Detail & Related papers (2025-09-24T08:46:29Z) - The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation [2.497854684676663]
We present a comprehensive study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB.<n>Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers.
arXiv Detail & Related papers (2025-08-20T13:53:13Z) - Clarifying Before Reasoning: A Coq Prover with Structural Context [13.273599284897411]
We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context leads to a 1.85$times$ improvement in clarity score.<n>We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages.
arXiv Detail & Related papers (2025-07-03T11:35:34Z) - Reasoning with Reinforced Functional Token Tuning [70.96651128307985]
We propose Reinforced Functional Token Tuning (RFTT) to empower Large Language Models (LLMs) with self-play learn-to-reason capabilities.<n>RFTT embeds a rich set of learnable functional tokens directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors.
arXiv Detail & Related papers (2025-02-19T02:59:42Z) - EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text [29.03935605732864]
We introduce StrucText-Eval, a benchmark to evaluate how well large language models understand and reason through structured text.
We show that while open-source LLMs achieve a maximum accuracy of 74.9% on the standard dataset, their performance drops significantly to 45.8% on the harder dataset.
In contrast, human participants reach an accuracy of 92.6% on StrucText-Eval-Hard, highlighting LLMs' current limitations in handling intricate structural information.
arXiv Detail & Related papers (2024-06-15T12:48:00Z) - T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - Prompt-based Learning for Text Readability Assessment [0.4757470449749875]
We propose the novel adaptation of a pre-trained seq2seq model for readability assessment.
We prove that a seq2seq model can be adapted to discern which text is more difficult from two given texts (pairwise)
arXiv Detail & Related papers (2023-02-25T18:39:59Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Conceptual Text Region Network: Cognition-Inspired Accurate Scene Text
Detection [7.716899861923764]
We propose a human cognition-inspired framework, termed Conceptual Text Region Network (CTRNet)
CTRNet utilizes Conceptual Text Regions (CTRs), which is a class of cognition-based tools inheriting good mathematical properties, allowing for sophisticated label design.
CTRNet achieves state-of-the-art performance on benchmark CTW1500, Total-Text, MSRA-TD500, and ICDAR 2015 datasets, yielding performance gains of up to 2.0%.
arXiv Detail & Related papers (2021-03-16T16:28:33Z) - Structure-Tags Improve Text Classification for Scholarly Document
Quality Prediction [4.4641025448898475]
We propose the use of HANs combined with structure-tags which mark the role of sentences in the document.
Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction.
arXiv Detail & Related papers (2020-04-30T22:34:34Z) - AMR Parsing via Graph-Sequence Iterative Inference [62.85003739964878]
We propose a new end-to-end model that treats AMR parsing as a series of dual decisions on the input sequence and the incrementally constructed graph.
We show that the answers to these two questions are mutually causalities.
We design a model based on iterative inference that helps achieve better answers in both perspectives, leading to greatly improved parsing accuracy.
arXiv Detail & Related papers (2020-04-12T09:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.