NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
- URL: http://arxiv.org/abs/2505.00903v1
- Date: Thu, 01 May 2025 22:47:06 GMT
- Title: NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
- Authors: Daria Gitman, Igor Gitman, Evelina Bakhturina,
- Abstract summary: We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities.<n>We demonstrate its effectiveness through two real-world cases.
- Score: 6.55530159050218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.
Related papers
- Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios [36.938892077684635]
Crucial-Diff is a domain-agnostic framework designed to synthesize crucial samples.<n>Our framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec.
arXiv Detail & Related papers (2025-07-14T04:41:38Z) - Improving Model Alignment Through Collective Intelligence of Open-Source LLMS [34.23134719050941]
We introduce MoAA, that leverages the collective strengths of various language models to provide high-quality data for model alignment.<n> evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2.
arXiv Detail & Related papers (2025-05-05T22:40:23Z) - SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level.<n>Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness.<n>Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.<n>We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.<n>At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.<n>Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and
the Case of Information Extraction [28.51694365908817]
This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by large language models.
We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation.
arXiv Detail & Related papers (2023-03-07T18:48:55Z) - ZeroGen$^+$: Self-Guided High-Quality Data Generation in Efficient
Zero-Shot Learning [97.2907428983142]
ZeroGen attempts to purely use PLM to generate data and train a tiny model without relying on task-specific annotation.
We propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
arXiv Detail & Related papers (2022-05-25T11:38:48Z) - Fix your Models by Fixing your Datasets [0.6058427379240697]
Current machine learning tools lack streamlined processes for improving the data quality.
We introduce a systematic framework for finding noisy or mislabelled samples in the dataset.
We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies.
arXiv Detail & Related papers (2021-12-15T02:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.