Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
- URL: http://arxiv.org/abs/2603.02470v1
- Date: Mon, 02 Mar 2026 23:36:38 GMT
- Title: Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
- Authors: Jingxuan Men, Mahdi Boloursaz Mashhadi, Ning Wang, Yi Ma, Mike Nilsson, Rahim Tafazolli,
- Abstract summary: Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs)<n>We propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication.
- Score: 24.169863403324314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs), where tokens serve as unified units of communication and computation, enabling efficient semantic- and goal-oriented information exchange in future wireless networks. In this paper, we propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication with Unequal Error Protection (UEP)-based source-channel coding adaptation. The proposed framework integrates user-intended textual descriptions with discrete video tokenization and unequal error protection to enhance semantic fidelity under restrictive bandwidth constraints. First, discrete video tokens are extracted through a pretrained video tokenizer, while text-conditioned vision-language modeling and optical-flow propagation are jointly used to identify tokens that correspond to user-intended semantics across space and time. Next, we introduce a semantic-aware multi-rate bit-allocation strategy, in which tokens highly related to the user intent are encoded using full codebook precision, whereas non-intended tokens are represented through reduced codebook precision differential encoding, enabling rate savings while preserving semantic quality. Finally, a source and channel coding adaptation scheme is developed to adapt bit allocation and channel coding to varying resources and link conditions. Experiments on various video datasets demonstrate that the proposed framework outperforms both conventional and semantic communication baselines, in perceptual and semantic quality on a wide SNR range.
Related papers
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation [20.393987361723724]
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning.<n>Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates.<n>We introduce CRAFT, a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space.
arXiv Detail & Related papers (2026-02-23T02:39:26Z) - Wireless TokenCom: RL-Based Tokenizer Agreement for Multi-User Wireless Token Communications [59.84545048095092]
Token Communications (TokenCom) has recently emerged as an effective new paradigm, where tokens are the unified units of communications computations.<n>We investigate a multi-user downlink wireless TokenCom scenario, where the base station transmits multiple users.
arXiv Detail & Related papers (2026-02-12T19:00:33Z) - Context-Aware Iterative Token Detection and Masked Transmission for Wireless Token Communication [20.850802765685145]
We propose a context-aware token communication framework that uses a shared contextual probability model between the transmitter (Tx) and receiver (Rx)<n>We introduce a context-aware masking strategy which skips highly predictable token transmission to reduce transmission rate.
arXiv Detail & Related papers (2026-01-25T10:10:51Z) - Context Video Semantic Transmission with Variable Length and Rate Coding over MIMO Channels [49.624608869195065]
We propose the context video semantic transmission (CVST) framework for wireless video transmission.<n>We learn a context-channel correlation map to explicitly formulate the relationships between feature groups and multiple input multiple output (MIMO) subchannels.<n>We demonstrate substantial performance gains over various standardized separated coding methods and recent wireless video semantic communication approaches.
arXiv Detail & Related papers (2025-12-23T10:48:43Z) - Joint Semantic-Channel Coding and Modulation for Token Communications [37.814311208185906]
We consider the problem of token communication, studying how to transmit tokens efficiently and reliably.<n>We propose a joint semantic-channel and modulation scheme for the token encoder, mapping point tokens to standard digital constellation points.<n>The proposed method outperforms both joint semantic-channel coding and traditional separate coding.
arXiv Detail & Related papers (2025-11-19T18:56:32Z) - SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Adaptive Semantic Token Selection for AI-native Goal-oriented Communications [11.92172357956248]
We propose a novel design for AI-native goal-oriented communications.
We exploit transformer neural networks under dynamic inference constraints on bandwidth and computation.
We show that our model improves over state-of-the-art token selection mechanisms.
arXiv Detail & Related papers (2024-04-25T13:49:50Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.