Dialz: A Python Toolkit for Steering Vectors
- URL: http://arxiv.org/abs/2505.06262v2
- Date: Tue, 03 Jun 2025 15:34:01 GMT
- Title: Dialz: A Python Toolkit for Steering Vectors
- Authors: Zara Siddique, Liam D. Turner, Luis Espinosa-Anke,
- Abstract summary: We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs.<n>Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis.<n>We release Dialz with full documentation, tutorials, and support for popular open-source models.
- Score: 9.734705470760511
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a 'concept', e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.
Related papers
- Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype [2.7624021966289605]
This paper presents a review of Contextual Multi-Armed Bandit (CMAB) methods and introduces an experimental framework for scalable, interpretable offer selection.<n>The approach models context at the product category level, allowing offers to span multiple categories and enabling knowledge transfer across similar offers.
arXiv Detail & Related papers (2025-05-22T17:13:01Z) - Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - Darkit: A User-Friendly Software Toolkit for Spiking Large Language Model [50.37090759139591]
Large language models (LLMs) have been widely applied in various practical applications, typically comprising billions of parameters.<n>The human brain, employing bio-plausible spiking mechanisms, can accomplish the same tasks while significantly reducing energy consumption.<n>We are releasing a software toolkit named DarwinKit (Darkit) to accelerate the adoption of brain-inspired large language models.
arXiv Detail & Related papers (2024-12-20T07:50:08Z) - LatentQA: Teaching LLMs to Decode Activations Into Natural Language [72.87064562349742]
We introduce LatentQA, the task of answering open-ended questions about model activations in natural language.<n>We propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs.<n>Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations.
arXiv Detail & Related papers (2024-12-11T18:59:33Z) - Improving Instruction-Following in Language Models through Activation Steering [58.876600545898675]
We derive instruction-specific vector representations from language models and use them to steer models accordingly.<n>We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion.<n>Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
arXiv Detail & Related papers (2024-10-15T08:38:20Z) - DeepDecipher: Accessing and Investigating Neuron Activation in Large
Language Models [2.992602379681373]
DeepDecipher is an API and interface for probing neurons in transformer models' layers.
This paper outlines DeepDecipher's design and capabilities.
We demonstrate how to analyze neurons, compare models, and gain insights into model behavior.
arXiv Detail & Related papers (2023-10-03T08:15:20Z) - pymdp: A Python library for active inference in discrete state spaces [52.85819390191516]
pymdp is an open-source package for simulating active inference in Python.
We provide the first open-source package for simulating active inference with POMDPs.
arXiv Detail & Related papers (2022-01-11T12:18:44Z) - GenNI: Human-AI Collaboration for Data-Backed Text Generation [102.08127062293111]
Table2Text systems generate textual output based on structured data utilizing machine learning.
GenNI (Generation Negotiation Interface) is an interactive visual system for high-level human-AI collaboration in producing descriptive text.
arXiv Detail & Related papers (2021-10-19T18:07:07Z) - EXPATS: A Toolkit for Explainable Automated Text Scoring [2.299617836036273]
We present EXPATS, an open-source framework to allow users to develop and experiment with different ATS models quickly.
The toolkit also provides seamless integration with the Language Interpretability Tool (LIT) so that one can interpret and visualize models and their predictions.
arXiv Detail & Related papers (2021-04-07T19:29:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.