Related papers: VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

URL: http://arxiv.org/abs/2406.18538v1
Date: Fri, 17 May 2024 06:11:10 GMT
Title: VideoQA-SC: Adaptive Semantic Communication for Video Question Answering
Authors: Jiangyuan Guo, Wei Chen, Yuxuan Sun, Jialong Xu, Bo Ai,
Abstract summary: We propose an end-to-end SC system for video question answering tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels. Our results show the great potential of task-oriented SC system design for video applications.
Score: 21.0279034601774
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system for video question answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of task-oriented SC system design for video applications.

Related papers

FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z)
WVSC: Wireless Video Semantic Communication with Multi-frame Compensation [56.63352157833874]
Existing wireless video transmission schemes directly conduct video coding in pixel level. We propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios.
arXiv Detail & Related papers (2025-03-27T06:27:15Z)
Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model [55.71885688565501]
We propose a scalable generative video semantic communication framework that extracts and transmits semantic information to achieve high-quality video reconstruction. Specifically, at the transmitter, description and other condition signals are extracted from the source video, functioning as text and structural semantics, respectively. At the receiver, the diffusion-based GenAI large models are utilized to fuse the semantics of the multiple modalities for reconstructing the video.
arXiv Detail & Related papers (2025-02-19T15:59:07Z)
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression [68.31184784672227]
In modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks. It is therefore useful to optimize the encoder for a downstream task instead of for image quality. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task.
arXiv Detail & Related papers (2025-01-21T15:36:08Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
Object-Attribute-Relation Representation based Video Semantic Communication [35.87160453583808]
We introduce the use of object-attribute-relation (OAR) as a semantic framework for videos to facilitate low bit-rate coding. We utilize OAR sequences for both low bit-rate representation and generative video reconstruction. Our experiments on traffic surveillance video datasets assess the effectiveness of our approach in terms of video transmission performance.
arXiv Detail & Related papers (2024-06-15T02:19:31Z)
Attention-based UNet enabled Lightweight Image Semantic Communication System over Internet of Things [4.62215026195301]
We study the problem of the lightweight image semantic communication system that is deployed on Internet of Things (IoT) devices. We propose an attention-based UNet enabled lightweight image semantic communication (LSSC) system, which achieves low computational complexity and small model size.
arXiv Detail & Related papers (2024-01-14T16:46:50Z)
Cross-layer scheme for low latency multiple description video streaming over Vehicular Ad-hoc NETworks (VANETs) [2.2124180701409233]
HEVC standard is very promising for real-time video streaming. New state-of-the-art video coding (HEVC) standard is very promising for real-time video streaming. We propose an original cross-layer system in order to enhance received video quality in vehicular communications.
arXiv Detail & Related papers (2023-11-05T14:34:58Z)
Streaming Video Model [90.24390609039335]
We propose to unify video understanding tasks into one streaming video architecture, referred to as Streaming Vision Transformer (S-ViT) S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve frame-based video tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition.
arXiv Detail & Related papers (2023-03-30T08:51:49Z)
Wireless Deep Video Semantic Transmission [14.071114007641313]
We propose a new class of high-efficiency deep joint source-channel coding methods to achieve end-to-end video transmission over wireless channels. Our framework is collected under the name deep video semantic transmission (DVST)
arXiv Detail & Related papers (2022-05-26T03:26:43Z)
Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC) We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue. To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
DeepWiVe: Deep-Learning-Aided Wireless Video Transmission [0.0]
We present DeepWiVe, the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme. We use deep neural networks (DNNs) to map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our results show that DeepWiVe can overcome the cliff-effect, which is prevalent in conventional separation-based digital communication schemes.
arXiv Detail & Related papers (2021-11-25T11:34:24Z)
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding. We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.