Related papers: Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

URL: http://arxiv.org/abs/2602.08979v1
Date: Mon, 09 Feb 2026 18:28:10 GMT
Title: Beyond Transcripts: A Renewed Perspective on Audio Chaptering
Authors: Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel,
Abstract summary: We show that a novel audio-only architecture (AudioSeg) outperforms text-based approaches for segmenting long-form audio into coherent sections.<n>Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following.
Score: 66.61445564139052
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

Related papers

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features [17.9089265435157]
We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries.<n> Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines.
arXiv Detail & Related papers (2026-02-06T12:16:51Z)
Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations [18.74784108693223]
Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding.<n>The extent to which SLMs encode nuanced syntactic and conceptual features remains unclear.<n>This study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs.
arXiv Detail & Related papers (2025-09-19T06:29:33Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context [45.56363286769136]
We introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently.<n>Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content.<n>We propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering.
arXiv Detail & Related papers (2025-03-19T15:34:21Z)
ADIFF: Explaining audio difference using natural language [31.963783032080993]
This paper comprehensively studies the task of explaining audio differences and then propose benchmark, baselines for the task.<n>We present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets.<n>We propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations.
arXiv Detail & Related papers (2025-02-06T20:00:43Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.