Related papers: Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

URL: http://arxiv.org/abs/2602.06647v1
Date: Fri, 06 Feb 2026 12:16:51 GMT
Title: Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features
Authors: Steffen Freisinger, Philipp Seeberger, Tobias Bocklet, Korbinian Riedhammer,
Abstract summary: We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries.<n> Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines.
Score: 17.9089265435157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.

Related papers

Beyond Transcripts: A Renewed Perspective on Audio Chaptering [66.61445564139052]
We show that a novel audio-only architecture (AudioSeg) outperforms text-based approaches for segmenting long-form audio into coherent sections.<n>Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following.
arXiv Detail & Related papers (2026-02-09T18:28:10Z)
Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation [65.7990140284317]
We focus on object grounding, i.e., localizing an object of interest in a visual scene based on verbal human instructions.<n>To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions.<n>Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods.
arXiv Detail & Related papers (2025-11-27T02:00:28Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Discrete Audio Tokens: More Than a Survey! [137.3721175670642]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer. Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes. These introductions contain information about the episodes' topics, hosts, and guests. We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.