TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
- URL: http://arxiv.org/abs/2407.12028v1
- Date: Fri, 28 Jun 2024 23:49:26 GMT
- Title: TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
- Authors: Dimitrios C. Gklezakos, Timothy Misiak, Diamond Bishop,
- Abstract summary: We present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees.
We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines.
Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora, demonstrating that it outperforms all baselines. Finally, we introduce TinyRec, a small-scale corpus of manually annotated transcripts, obtained from self-recorded video sessions.
Related papers
- From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation [55.82577086422923]
We provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues.
We release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues.
We also provide a benchmark including 18 models across five categories for the dialogue segmentation task.
arXiv Detail & Related papers (2023-05-15T06:08:01Z) - MUG: A General Meeting Understanding and Generation Benchmark [60.09540662936726]
We build the AliMeeting4MUG Corpus, which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage.
In this paper, we provide a detailed introduction of this corpus, SLP tasks and evaluation methods, baseline systems and their performance.
arXiv Detail & Related papers (2023-03-24T11:52:25Z) - Toward Unifying Text Segmentation and Long Document Summarization [31.084738269628748]
We study the role that section segmentation plays in extractive summarization of written and spoken documents.
Our approach learns robust sentence representations by performing summarization and segmentation simultaneously.
Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability.
arXiv Detail & Related papers (2022-10-28T22:07:10Z) - Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes.
These introductions contain information about the episodes' topics, hosts, and guests.
We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z) - An End-to-End Document-Level Neural Discourse Parser Exploiting
Multi-Granularity Representations [24.986030179701405]
We exploit robust representations derived from multiple levels of granularity across syntax and semantics.
We incorporate such representations in an end-to-end encoder-decoder neural architecture for more resourceful discourse processing.
arXiv Detail & Related papers (2020-12-21T08:01:04Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.