DVAGen: Dynamic Vocabulary Augmented Generation
- URL: http://arxiv.org/abs/2510.17115v1
- Date: Mon, 20 Oct 2025 03:09:24 GMT
- Title: DVAGen: Dynamic Vocabulary Augmented Generation
- Authors: Wei Du, Nuowei Liu, Jie Wang, Jiahao Kuang, Tao Ji, Xiaoling Wang, Yuanbin Wu,
- Abstract summary: We introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models.<n>Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection.
- Score: 30.433077181096753
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
Related papers
- Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Scaling LLM Pre-training with Vocabulary Curriculum [0.0]
We introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size.<n>Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities.<n>Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization.
arXiv Detail & Related papers (2025-02-25T07:18:29Z) - Chunk-Distilled Language Modeling [25.238256586953487]
Chunk-Distilled Language Modeling (CD-LM) is an approach to text generation that addresses two challenges in current large language models (LLMs)<n>Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step.
arXiv Detail & Related papers (2024-12-31T08:32:15Z) - Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? [23.83290627671739]
VocADT is a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings.<n>We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation.
arXiv Detail & Related papers (2024-10-12T20:45:24Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.