TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability
- URL: http://arxiv.org/abs/2507.19419v2
- Date: Tue, 30 Sep 2025 18:21:41 GMT
- Title: TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability
- Authors: Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna P. Gummadi, Willie Neiswanger, Robin Jia,
- Abstract summary: TokenSmith is an open-source library for interactive editing, inspection, and analysis of datasets.<n>It supports datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo.<n> TokenSmith is hosted on GitHub, with accompanying documentation, tutorials, and a demonstration video.
- Score: 39.43508569004967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug-and-play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub, with accompanying documentation, tutorials, and a demonstration video (available on YouTube).
Related papers
- Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance [55.32799307123252]
We introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets.<n>We propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.
arXiv Detail & Related papers (2026-03-02T18:46:28Z) - MLPrE -- A tool for preprocessing and exploratory data analysis prior to machine learning model construction [0.24629531282150877]
We present Machine Learning Preprocessing and Exploratory Data Analysis,DatarE.<n>DataFrames were utilized to hold data during processing and ensure scalability.<n>A total of 69 stages were implemented intorE, of which we highlight and demonstrate key stages using six diverse datasets.
arXiv Detail & Related papers (2025-10-29T17:52:39Z) - PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We study building a Perception Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.<n>We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.<n>To bridge these gaps, we release PLM-VideoBench, a suite for evaluating challenging video understanding tasks.
arXiv Detail & Related papers (2025-04-17T17:59:56Z) - Deep Fast Machine Learning Utils: A Python Library for Streamlined Machine Learning Prototyping [0.0]
The Deep Fast Machine Learning Utils (DFMLU) library provides tools designed to automate and enhance aspects of machine learning processes.
DFMLU offers functionalities that support model development and data handling.
This manuscript presents an overview of DFMLU's functionalities, providing Python examples for each tool.
arXiv Detail & Related papers (2024-09-14T21:39:17Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - TrueLearn: A Python Library for Personalised Informational
Recommendations with (Implicit) Feedback [4.575111313202425]
This work describes the TrueLearn Python library, which contains a family of online learning Bayesian models.
For the sake of interpretability and putting the user in control, the TrueLearn library also contains different representations to help end-users visualise the learner models.
arXiv Detail & Related papers (2023-09-20T07:21:50Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [117.48444197402858]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.<n>ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.<n>We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - Cross-Modal Adapter for Vision-Language Retrieval [60.59577149733934]
We present a novel Cross-Modal Adapter for parameter-efficient transfer learning.<n>Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.<n>Our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - AdapterHub Playground: Simple and Flexible Few-Shot Learning with
Adapters [34.86139827292556]
Open-access dissemination of pretrained language models has led to a democratization of state-of-the-art natural language processing (NLP) research.
This also allows people outside of NLP to use such models and adapt them to specific use-cases.
In this work, we aim to overcome this gap by providing a tool which allows researchers to leverage pretrained models without writing a single line of code.
arXiv Detail & Related papers (2021-08-18T11:56:01Z) - LayoutParser: A Unified Toolkit for Deep Learning Based Document Image
Analysis [3.4253416336476246]
This paper introduces layoutparser, an open-source library for streamlining the usage of deep learning (DL) models in document image analysis (DIA) research and applications.
layoutparser comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks.
We demonstrate that layoutparser is helpful for both lightweight and large-scale pipelines in real-word use cases.
arXiv Detail & Related papers (2021-03-29T05:55:08Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - ktrain: A Low-Code Library for Augmented Machine Learning [0.0]
ktrain is a low-code Python library that makes machine learning more accessible and easier to apply.
It is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and apply by both beginners and experienced practitioners.
arXiv Detail & Related papers (2020-04-19T14:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.