Related papers: VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

URL: http://arxiv.org/abs/2602.12758v1
Date: Fri, 13 Feb 2026 09:37:10 GMT
Title: VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction
Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das, Sarbajit Pal,
Abstract summary: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing.<n>This work delineates an adaptive conferencing system that integrates media delivery with a supplementary audio-driven talking-head reconstruction pathway.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.

Related papers

Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
qAttCNN - Self Attention Mechanism for Video QoE Prediction in Encrypted Traffic [2.4851388650413866]
Video conferencing applications (VCAs) and instant messaging applications (IMAs) like WhatsApp and Telegram increasingly support video conferencing as a core feature.<n>End-to-end encryption, commonly used by modern VCAs and IMAs, prevent ISPs from accessing the original media stream.<n>We propose the QoE Attention Convolutional Neural Network (qAttCNN) to infer two no-reference QoE metrics viz. BRISQUE and frames per second (FPS)<n>We evaluate qAttCNN on a custom dataset collected from WhatsApp video calls and compare it against existing QoE models.
arXiv Detail & Related papers (2026-01-11T11:08:40Z)
Context Video Semantic Transmission with Variable Length and Rate Coding over MIMO Channels [49.624608869195065]
We propose the context video semantic transmission (CVST) framework for wireless video transmission.<n>We learn a context-channel correlation map to explicitly formulate the relationships between feature groups and multiple input multiple output (MIMO) subchannels.<n>We demonstrate substantial performance gains over various standardized separated coding methods and recent wireless video semantic communication approaches.
arXiv Detail & Related papers (2025-12-23T10:48:43Z)
Large Speech Model Enabled Semantic Communication [58.027223937172955]
Large Speech Model enabled Semantic Communication (LargeSC) system.<n>We exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels.<n>System supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates.
arXiv Detail & Related papers (2025-12-04T11:58:08Z)
FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
Semantic-Aware Adaptive Video Streaming Using Latent Diffusion Models for Wireless Networks [12.180483357502293]
This paper proposes a novel framework for real-time adaptivebitrate video streaming by integrating Latent Diffusion Models (LDMs) within the FF techniques.<n>The proposed approach leverages LDMs to compress I-frames into a latent space, offering significant storage and semantic transmission savings.<n>This work opens new possibilities for scalable real-time video streaming in 5G and future post-5G networks.
arXiv Detail & Related papers (2025-02-08T21:14:28Z)
VideoQA-SC: Adaptive Semantic Communication for Video Question Answering [21.0279034601774]
We propose an end-to-end SC system, named VideoQA-SC for video question answering tasks.<n>Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels.<n>Our results show the great potential of SC system design for video applications.
arXiv Detail & Related papers (2024-05-17T06:11:10Z)
Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
DeepWiVe: Deep-Learning-Aided Wireless Video Transmission [0.0]
We present DeepWiVe, the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme. We use deep neural networks (DNNs) to map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our results show that DeepWiVe can overcome the cliff-effect, which is prevalent in conventional separation-based digital communication schemes.
arXiv Detail & Related papers (2021-11-25T11:34:24Z)
A Deep Learning Approach for Low-Latency Packet Loss Concealment of Audio Signals in Networked Music Performance Applications [66.56753488329096]
Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications. This article describes a technique for predicting lost packet content in real-time using a deep learning approach.
arXiv Detail & Related papers (2020-07-14T15:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.