UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration
- URL: http://arxiv.org/abs/2509.22570v1
- Date: Fri, 26 Sep 2025 16:46:12 GMT
- Title: UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration
- Authors: Qi Mao, Tinghan Yang, Jiahao Li, Bin Li, Libiao Jin, Yan Lu,
- Abstract summary: We propose UniMIC, a token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents.<n>UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission.<n>Experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial savings.
- Score: 21.633481068383194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach [55.861432910722186]
UniToCom is a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission.<n>We propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information.<n>We employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens.
arXiv Detail & Related papers (2025-07-02T14:03:01Z) - Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks [19.42660454288912]
We propose a task-oriented multimodal token transmission scheme for efficient multimodal information fusion and utilization.<n>To improve inter-modal consistency and task-relevant token transmission, we design a two-stage training algotithm.<n>We also formulate a weighted-sum optimization problem over latency and inference performance.
arXiv Detail & Related papers (2025-05-06T14:17:05Z) - InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals [9.648001493025204]
InfoMAE is a cross-modal alignment framework that tackles the challenge of multimodal pair efficiency under the SSL setting.<n>It enhances downstream multimodal tasks by over 60% with significantly improved multimodal pairing efficiency.<n>It also improves unimodal task accuracy by an average of 22%.
arXiv Detail & Related papers (2025-04-13T20:03:29Z) - Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [49.77734021302196]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z) - OminiControl2: Efficient Conditioning for Diffusion Transformers [68.3243031301164]
We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation.<n>OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps.
arXiv Detail & Related papers (2025-03-11T10:50:14Z) - Large Body Language Models [1.9797215742507548]
We introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video)
LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Freche's Gesture Distance (FGD) and a 25% improvement in Freche's Inception Distance compared to existing approaches.
arXiv Detail & Related papers (2024-10-21T21:48:24Z) - Multimodal generative semantic communication based on latent diffusion model [13.035207938169844]
This paper introduces a multimodal generative semantic communication framework named mm-GESCO.
The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them.
At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps.
arXiv Detail & Related papers (2024-08-10T06:23:41Z) - Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework.
We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data.
Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery.
Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.