Topic Segmentation in the Wild: Towards Segmentation of Semi-structured
& Unstructured Chats
- URL: http://arxiv.org/abs/2211.14954v1
- Date: Sun, 27 Nov 2022 22:17:16 GMT
- Title: Topic Segmentation in the Wild: Towards Segmentation of Semi-structured
& Unstructured Chats
- Authors: Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava,
Samyadeep Basu, Soundararajan Srinivasan
- Abstract summary: We analyze the capabilities of state-of-the-art topic segmentation models on unstructured texts.
We find that pre-training on a large corpus of structured text does not help in transferability to unstructured texts.
Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin.
- Score: 2.9360071145551068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Breaking down a document or a conversation into multiple contiguous segments
based on its semantic structure is an important and challenging problem in NLP,
which can assist many downstream tasks. However, current works on topic
segmentation often focus on segmentation of structured texts. In this paper, we
comprehensively analyze the generalization capabilities of state-of-the-art
topic segmentation models on unstructured texts. We find that: (a) Current
strategies of pre-training on a large corpus of structured text such as
Wiki-727K do not help in transferability to unstructured texts. (b) Training
from scratch with only a relatively small-sized dataset of the target
unstructured domain improves the segmentation results by a significant margin.
Related papers
- WAS: Dataset and Methods for Artistic Text Segmentation [57.61335995536524]
This paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset.
We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes.
We also propose a skeleton-assisted head to guide the model to focus on the global structure.
arXiv Detail & Related papers (2024-07-31T18:29:36Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Topic Segmentation of Semi-Structured and Unstructured Conversational
Datasets using Language Models [3.7908886926768344]
Current works on topic segmentation often focus on segmentation of structured texts.
We propose Focal Loss function as a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
arXiv Detail & Related papers (2023-10-26T03:37:51Z) - Topic-driven Distant Supervision Framework for Macro-level Discourse
Parsing [72.14449502499535]
The task of analyzing the internal rhetorical structure of texts is a challenging problem in natural language processing.
Despite the recent advances in neural models, the lack of large-scale, high-quality corpora for training remains a major obstacle.
Recent studies have attempted to overcome this limitation by using distant supervision.
arXiv Detail & Related papers (2023-05-23T07:13:51Z) - Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue:
An Empirical Study [51.079100495163736]
This paper systematically inspects ChatGPT's performance in two discourse analysis tasks: topic segmentation and discourse parsing.
ChatGPT demonstrates proficiency in identifying topic structures in general-domain conversations yet struggles considerably in specific-domain conversations.
Our deeper investigation indicates that ChatGPT can give more reasonable topic structures than human annotations but only linearly parses the hierarchical rhetorical structures.
arXiv Detail & Related papers (2023-05-15T07:14:41Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Toward Unifying Text Segmentation and Long Document Summarization [31.084738269628748]
We study the role that section segmentation plays in extractive summarization of written and spoken documents.
Our approach learns robust sentence representations by performing summarization and segmentation simultaneously.
Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability.
arXiv Detail & Related papers (2022-10-28T22:07:10Z) - Topical Change Detection in Documents via Embeddings of Long Sequences [4.13878392637062]
We formulate the task of text segmentation as an independent supervised prediction task.
By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information.
Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context.
arXiv Detail & Related papers (2020-12-07T12:09:37Z) - Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text
Segmentation [9.416757363901295]
We introduce a novel supervised model for text segmentation with simple but explicit coherence modeling.
Our model -- a neural architecture consisting of two hierarchically connected Transformer networks -- is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones.
arXiv Detail & Related papers (2020-01-03T17:06:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.