Related papers: LG-VQ: Language-Guided Codebook Learning

LG-VQ: Language-Guided Codebook Learning

URL: http://arxiv.org/abs/2405.14206v1
Date: Thu, 23 May 2024 06:04:40 GMT
Title: LG-VQ: Language-Guided Codebook Learning
Authors: Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, linfeng Luo,
Abstract summary: Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis. We propose a novel language-guided codebook learning framework, called LG-VQ. Our method achieves superior performance on reconstruction and various multi-modal downstream tasks.
Score: 36.422599206253324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.

Related papers

Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z)
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models. We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv Detail & Related papers (2025-03-24T07:27:19Z)
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation [71.68085485928007]
We introduce SemHiTok, a unified image tokenizer via Semantic-Guided Hierarchical codebook.<n>We show that SemHiTok achieves SOTA performance in image reconstruction and multimodal understanding under LLaVA-v1.5 setting.<n>We also develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks.
arXiv Detail & Related papers (2025-03-09T20:42:34Z)
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text [17.35793995814643]
We propose a novel Text-Augmented Codebook Learning framework, named TA-VQ. It generates longer text for each image using the visual-language model for improved text-aligned codebook learning. To tackle two challenges, we propose to split the long text into multiple granularities for encoding, i.e., word, phrase, and sentence.
arXiv Detail & Related papers (2025-03-03T07:38:18Z)
Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling [15.132926378740882]
We propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-03-15T07:24:13Z)
UniCode: Learning a Unified Codebook for Multimodal Large Language Models [33.48624855154342]
We propose textbfUniCode, a novel approach within the domain of multimodal large language models (MLLMs) UniCode learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation.
arXiv Detail & Related papers (2024-03-14T03:29:58Z)
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z)
Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning. RLDF is a singular approach for visual imitation through prior-preserving reward function guidance. It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z)
Tackling VQA with Pretrained Foundation Models without Further Training [0.0]
Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA) In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem.
arXiv Detail & Related papers (2023-09-27T08:35:24Z)
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z)
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm. We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z)
Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation. Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.