VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction
- URL: http://arxiv.org/abs/2602.12758v1
- Date: Fri, 13 Feb 2026 09:37:10 GMT
- Title: VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction
- Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das, Sarbajit Pal,
- Abstract summary: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing.<n>This work delineates an adaptive conferencing system that integrates media delivery with a supplementary audio-driven talking-head reconstruction pathway.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.
Related papers
- Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z) - qAttCNN - Self Attention Mechanism for Video QoE Prediction in Encrypted Traffic [2.4851388650413866]
Video conferencing applications (VCAs) and instant messaging applications (IMAs) like WhatsApp and Telegram increasingly support video conferencing as a core feature.<n>End-to-end encryption, commonly used by modern VCAs and IMAs, prevent ISPs from accessing the original media stream.<n>We propose the QoE Attention Convolutional Neural Network (qAttCNN) to infer two no-reference QoE metrics viz. BRISQUE and frames per second (FPS)<n>We evaluate qAttCNN on a custom dataset collected from WhatsApp video calls and compare it against existing QoE models.
arXiv Detail & Related papers (2026-01-11T11:08:40Z) - Context Video Semantic Transmission with Variable Length and Rate Coding over MIMO Channels [49.624608869195065]
We propose the context video semantic transmission (CVST) framework for wireless video transmission.<n>We learn a context-channel correlation map to explicitly formulate the relationships between feature groups and multiple input multiple output (MIMO) subchannels.<n>We demonstrate substantial performance gains over various standardized separated coding methods and recent wireless video semantic communication approaches.
arXiv Detail & Related papers (2025-12-23T10:48:43Z) - Large Speech Model Enabled Semantic Communication [58.027223937172955]
Large Speech Model enabled Semantic Communication (LargeSC) system.<n>We exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels.<n>System supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates.
arXiv Detail & Related papers (2025-12-04T11:58:08Z) - FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z) - Semantic-Aware Adaptive Video Streaming Using Latent Diffusion Models for Wireless Networks [12.180483357502293]
This paper proposes a novel framework for real-time adaptivebitrate video streaming by integrating Latent Diffusion Models (LDMs) within the FF techniques.<n>The proposed approach leverages LDMs to compress I-frames into a latent space, offering significant storage and semantic transmission savings.<n>This work opens new possibilities for scalable real-time video streaming in 5G and future post-5G networks.
arXiv Detail & Related papers (2025-02-08T21:14:28Z) - VideoQA-SC: Adaptive Semantic Communication for Video Question Answering [21.0279034601774]
We propose an end-to-end SC system, named VideoQA-SC for video question answering tasks.<n>Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels.<n>Our results show the great potential of SC system design for video applications.
arXiv Detail & Related papers (2024-05-17T06:11:10Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - DeepWiVe: Deep-Learning-Aided Wireless Video Transmission [0.0]
We present DeepWiVe, the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme.
We use deep neural networks (DNNs) to map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform.
Our results show that DeepWiVe can overcome the cliff-effect, which is prevalent in conventional separation-based digital communication schemes.
arXiv Detail & Related papers (2021-11-25T11:34:24Z) - A Deep Learning Approach for Low-Latency Packet Loss Concealment of
Audio Signals in Networked Music Performance Applications [66.56753488329096]
Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications.
This article describes a technique for predicting lost packet content in real-time using a deep learning approach.
arXiv Detail & Related papers (2020-07-14T15:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.