Segment Length Matters: A Study of Segment Lengths on Audio Fingerprinting Performance
- URL: http://arxiv.org/abs/2601.17690v1
- Date: Sun, 25 Jan 2026 04:32:32 GMT
- Title: Segment Length Matters: A Study of Segment Lengths on Audio Fingerprinting Performance
- Authors: Ziling Gong, Yunyan Ouyang, Iram Kamdar, Melody Ma, Hongjie Chen, Franck Dernoncourt, Ryan A. Rossi, Nesreen K. Ahmed,
- Abstract summary: We study how segment length affects audio fingerprinting performance.<n>Our results show that short segment lengths (0.5-second) generally achieve better performance.<n>Our findings provide practical guidance for selecting segment duration in large-scale neural audio retrieval systems.
- Score: 65.82811567989506
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio fingerprinting provides an identifiable representation of acoustic signals, which can be later used for identification and retrieval systems. To obtain a discriminative representation, the input audio is usually segmented into shorter time intervals, allowing local acoustic features to be extracted and analyzed. Modern neural approaches typically operate on short, fixed-duration audio segments, yet the choice of segment duration is often made heuristically and rarely examined in depth. In this paper, we study how segment length affects audio fingerprinting performance. We extend an existing neural fingerprinting architecture to adopt various segment lengths and evaluate retrieval accuracy across different segment lengths and query durations. Our results show that short segment lengths (0.5-second) generally achieve better performance. Moreover, we evaluate LLM capacity in recommending the best segment length, which shows that GPT-5-mini consistently gives the best suggestions across five considerations among three studied LLMs. Our findings provide practical guidance for selecting segment duration in large-scale neural audio retrieval systems.
Related papers
- Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction [50.094751096858204]
LAIN is a plug-and-play framework that incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling.<n>Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.
arXiv Detail & Related papers (2026-01-27T03:14:20Z) - AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs [53.248502396225724]
AudioMarathon is a benchmark designed to evaluate both understanding and inference efficiency on long-form audio.<n>We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows.<n>The results show large gaps across current LALMs and highlight the need for better temporal reasoning.
arXiv Detail & Related papers (2025-10-08T17:50:16Z) - Forensic deepfake audio detection using segmental speech features [27.29588853432526]
This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio.<n>Certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep-fakes, whereas some global features provide little value.
arXiv Detail & Related papers (2025-05-20T02:42:46Z) - A Flexible and Scalable Framework for Video Moment Search [51.47907684209207]
This paper introduces a flexible framework for retrieving a ranked list of moments from collection of videos in any length to match a text query.<n>Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking.<n> Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time.
arXiv Detail & Related papers (2025-01-09T08:54:19Z) - Real Time Multi Organ Classification on Computed Tomography Images [0.08192907805418582]
We demonstrate a method to obtain organ labels in real time by using a large context size with a sparse data sampling strategy.<n>Although our method operates as an independent classifier at query locations, it can generate full segmentations by querying grid locations at any resolution.
arXiv Detail & Related papers (2024-04-29T14:17:52Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - Universal speaker recognition encoders for different speech segments
duration [7.104489204959814]
A system trained simultaneously on pooled short and long speech segments does not give optimal verification results.
We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture.
arXiv Detail & Related papers (2022-10-28T16:06:00Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR [38.79441296832869]
We propose an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion.
We demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.
arXiv Detail & Related papers (2022-04-22T15:13:12Z) - Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded.
We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z) - Neural Sequence Segmentation as Determining the Leftmost Segments [25.378188980430256]
We propose a novel framework that incrementally segments natural language sentences at segment level.
For every step in segmentation, it recognizes the leftmost segment of the remaining sequence.
We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech tagging across 3 datasets.
arXiv Detail & Related papers (2021-04-15T03:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.