Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
- URL: http://arxiv.org/abs/2505.16845v1
- Date: Thu, 22 May 2025 16:10:01 GMT
- Title: Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
- Authors: Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu,
- Abstract summary: We propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time.<n>TFC enables seamlessly average frame rates and dynamically allocates frame rates based on temporal entropy.<n> Experimental results show that a neural reconstruction with TFC achieves optimal quality with high flexibility, and maintains competitive performance even at lower frame rates.
- Score: 14.03590336780589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
Related papers
- CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space [13.286622421661313]
Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech.<n>Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity.<n>We present CodecFlow, a neural-based BWE framework that performs efficient speech reconstruction in a compact latent space.
arXiv Detail & Related papers (2026-03-02T16:03:46Z) - CoPE-VideoLM: Codec Primitives For Efficient Video Language Models [56.76440182038839]
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos.<n>Current methods use sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage.<n>We propose to leverage video primitives which encode video redundancy and sparsity without requiring expensive full-image encoding for most frames.
arXiv Detail & Related papers (2026-02-13T18:57:31Z) - Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization [27.32235541083431]
We introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer.<n>DyCAST learns to associate tokens with character-level linguistic units during training.<n>We also introduce a retrieval-augmented decoding mechanism that enhances reconstruction without increasing fidelity.
arXiv Detail & Related papers (2026-01-30T16:58:40Z) - U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation [71.59514998928833]
U-Codec achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz.<n>We apply U-Codec into a large language model (LLM)-based auto-regressive TTS model.
arXiv Detail & Related papers (2025-10-19T05:09:20Z) - READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - Neural Video Compression with Context Modulation [9.875413481663742]
In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps.<n>We achieve on average 22.7% reduction over the advanced traditional video H.266/VVC, and offer an average 10.1% saving over the previous state-of-the-art NVC DCVC-FM.
arXiv Detail & Related papers (2025-05-20T15:57:09Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Towards Practical Real-Time Neural Video Compression [60.390180067626396]
We introduce a practical real-time neural video (NVC) designed to deliver high compression ratio, low latency and broad versatility.<n>Experiments show our proposed DCVC-RT achieves an impressive average encoding/desampling speed 125.2/112.8 (frames per second) for 1080p video, while saving an average of 21% in fps compared to H.266/VTM.
arXiv Detail & Related papers (2025-02-28T06:32:23Z) - Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing.
This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos [5.958701846880935]
We propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos.
With model compression techniques, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
arXiv Detail & Related papers (2022-12-23T12:51:42Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A
Neural Exploration via Resolution-Adaptive Learning [30.54722074562783]
We decompose the input video into respective spatial texture frames (STF) at its native spatial resolution.
Then, we compress them together using any popular video coder.
Finally, we synthesize decoded STFs and TMFs for high-quality video reconstruction at the same resolution as its native input.
arXiv Detail & Related papers (2020-12-01T17:23:53Z) - Conditional Entropy Coding for Efficient Video Compression [82.35389813794372]
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames.
We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs.
We then propose a novel internal learning extension on top of this architecture that brings an additional 10% savings without trading off decoding speed.
arXiv Detail & Related papers (2020-08-20T20:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.