Related papers: When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

URL: http://arxiv.org/abs/2512.17083v2
Date: Wed, 24 Dec 2025 18:05:57 GMT
Title: When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
Authors: Michael H. Coen,
Abstract summary: This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1)<n>By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone. We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.

Related papers

Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation [0.17240671897505613]
Dialogue Act (DA) annotation treats communicative or pedagogical intent as localized to individual utterances or turns.<n>We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria.<n>We find DA-awareness produces segments that are internally more consistent than text-only baselines.
arXiv Detail & Related papers (2026-01-17T14:17:13Z)
ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents [25.10969436399974]
ES-Mem is a framework that partitions long-term interactions into semantically coherent events with distinct boundaries.<n>We show that ES-Mem yields consistent performance gains over baseline methods.<n>The proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.
arXiv Detail & Related papers (2026-01-12T14:33:32Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z)
SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation [55.82577086422923]
We provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues. We release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues. We also provide a benchmark including 18 models across five categories for the dialogue segmentation task.
arXiv Detail & Related papers (2023-05-15T06:08:01Z)
Unsupervised Dialogue Topic Segmentation with Topic-aware Utterance Representation [51.22712675266523]
Dialogue Topic (DTS) plays an essential role in a variety of dialogue modeling tasks. We propose a novel unsupervised DTS framework, which learns topic-aware utterance representations from unlabeled dialogue data.
arXiv Detail & Related papers (2023-05-04T11:35:23Z)
Contrastive Boundary Learning for Point Cloud Segmentation [81.7289734276872]
We propose a novel contrastive boundary learning framework for point cloud segmentation. We experimentally show that CBL consistently improves different baselines and assists them to achieve compelling performance on boundaries.
arXiv Detail & Related papers (2022-03-10T10:08:09Z)
FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it. To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z)
Boundary Guided Context Aggregation for Semantic Segmentation [23.709865471981313]
We exploit boundary as a significant guidance for context aggregation to promote the overall semantic understanding of an image. We conduct extensive experiments on the Cityscapes and ADE20K databases, and comparable results are achieved with the state-of-the-art methods.
arXiv Detail & Related papers (2021-10-27T17:04:38Z)
Improving Multi-Party Dialogue Discourse Parsing via Domain Integration [25.805553277418813]
Multi-party conversations are implicitly organized by semantic level correlations across the interactive turns. dialogue discourse analysis can be applied to predict the dependency structure and relations between the elementary discourse units. Existing corpora with dialogue discourse annotation are collected from specific domains with limited sample sizes.
arXiv Detail & Related papers (2021-10-09T09:36:22Z)
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE. We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.