SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models
- URL: http://arxiv.org/abs/2602.21133v1
- Date: Tue, 24 Feb 2026 17:29:04 GMT
- Title: SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models
- Authors: Alessandro Londei, Denise Lanzieri, Matteo Benati,
- Abstract summary: We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks.<n> SOM-VQ produces more learnable token sequences while providing an explicit geometry in code space.<n>We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases make topology-aware control especially natural.
- Score: 41.99844472131922
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.
Related papers
- Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z) - Inverting Self-Organizing Maps: A Unified Activation-Based Framework [39.146761527401424]
We show that the activation pattern of a SOM can be inverted to recover the exact input under mild geometric conditions.<n>We introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule.<n>We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset.
arXiv Detail & Related papers (2026-01-20T11:02:54Z) - Neuro-Symbolic Spatial Reasoning in Segmentation [27.7231614319754]
Open-Vocabulary Semantic (OVSS) assigns pixel-level labels from an open set of categories, requiring to unseen and unlabelled objects.<n>We introduce neuro-symbolic (NeSy) spatial reasoning in OVSS.<n>This is the first attempt to explore NeSy spatial reasoning in OVSS.
arXiv Detail & Related papers (2025-10-17T17:35:34Z) - Cross-Layer Discrete Concept Discovery for Interpreting Language Models [13.842670153893977]
Cross-layer VQ-VAE is a framework that uses vector quantization to map representations across layers.<n>Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates.
arXiv Detail & Related papers (2025-06-24T22:43:36Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - GrootVL: Tree Topology is All You Need in State Space Model [66.36757400689281]
GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks.
Our method significantly outperforms existing structured state space models on image classification, object detection and segmentation.
By fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.
arXiv Detail & Related papers (2024-06-04T15:09:29Z) - Learning Disentangled Semantic Spaces of Explanations via Invertible Neural Networks [10.880057430629126]
Disentangled latent spaces usually have better semantic separability and geometrical properties, which leads to better interpretability and more controllable data generation.
In this work, we focus on a more general form of sentence disentanglement, targeting the localised modification and control of more general sentence semantic features.
We introduce a flow-based invertible neural network (INN) mechanism integrated with a transformer-based language Autoencoder (AE) in order to deliver latent spaces with better separability properties.
arXiv Detail & Related papers (2023-05-02T18:27:13Z) - Self-Organising Neural Discrete Representation Learning à la Kohonen [42.710124929514066]
We study an alternative Vector Quantisation (VQ) algorithm based on Kohonen's learning rule for the Self-Organising Map (KSOM)
In our experiments, the speed-up compared to well-configured EMA-VQ is only observable at the beginning of training.
arXiv Detail & Related papers (2023-02-15T21:04:04Z) - Multiscale Graph Neural Network Autoencoders for Interpretable
Scientific Machine Learning [0.0]
The goal of this work is to address two limitations in autoencoder-based models: latent space interpretability and compatibility with unstructured meshes.
This is accomplished here with the development of a novel graph neural network (GNN) autoencoding architecture with demonstrations on complex fluid flow applications.
arXiv Detail & Related papers (2023-02-13T08:47:11Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning.
Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector.
We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z) - Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images.
In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner.
We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.