Related papers: Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

URL: http://arxiv.org/abs/2505.12058v1
Date: Sat, 17 May 2025 15:40:03 GMT
Title: Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation
Authors: Vincent Koc,
Abstract summary: Tiny QA Benchmark++ (TQB++) is designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost.<n>TQB++ couples a 52-item English gold set with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM.<n>Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

Related papers

Qwen3-TTS Technical Report [64.94647392030824]
We present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models.<n>Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control.<n>Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers.
arXiv Detail & Related papers (2026-01-22T03:51:43Z)
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents [42.56489784841984]
"fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs.<n>We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact.
arXiv Detail & Related papers (2026-01-21T19:29:00Z)
How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation [1.3288901827225499]
We benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025.<n>Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput)<n>Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off,
arXiv Detail & Related papers (2025-11-12T21:25:50Z)
dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
Apple Intelligence Foundation Language Models: Tech Report 2025 [246.04717786298764]
We introduce two foundation language models that power Apple Intelligence features across Apple devices and services.<n>Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling.<n>A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning.
arXiv Detail & Related papers (2025-07-17T23:37:19Z)
Mercury: Ultra-Fast Language Models Based on Diffusion [58.52391675075641]
We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion.<n>Mercury Coder comes in two sizes: Mini and Small.<n>Based on independent evaluations, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively.
arXiv Detail & Related papers (2025-06-17T17:06:18Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
ZeroML: A Next Generation AutoML Language [0.0]
ZeroML is a new generation programming language for AutoML to drive the ML pipeline in a compiled and multi-paradigm way.<n>ZeroML brings the Microservices-based architecture adding the modular, reusable pieces such as DataCleaner, FeatureEngineer or ModelSelector.
arXiv Detail & Related papers (2025-05-23T16:01:49Z)
RAG-Based Fuzzing of Cross-Architecture Compilers [0.8302146576157498]
OneAPI is an open standard that supports cross-architecture software development with minimal effort from developers.<n>OneAPI brings DPC++ and C++ compilers which need to be thoroughly tested to verify their correctness, reliability, and security.<n>This paper proposes a large-language model (LLM)-based compiler fuzzing tool that integrates the concept of retrieval-augmented generation (RAG)
arXiv Detail & Related papers (2025-04-11T20:46:52Z)
Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z)
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage [2.1626093085892144]
We create an automated hardware fuzzing framework called FuzzWiz. It includes parsing the RTL design module, converting it into C/C++ models, creating generic testbench with assertions, linking, and fuzzing. Our benchmarking results show that we could achieve around 90% of the coverage 10 times faster than traditional simulation regression based approach.
arXiv Detail & Related papers (2024-10-23T10:06:08Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.<n>NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.<n>In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers [0.8192907805418583]
This study examines using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation.
arXiv Detail & Related papers (2024-04-23T02:19:35Z)
Adapting Language Models to Compress Contexts [71.98287002918941]
Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window. We propose to adapt pre-trained LMs into AutoCompressors, which are capable of compressing long contexts into compact summary vectors. We fine-tune OPT and Llama-2 models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity.
arXiv Detail & Related papers (2023-05-24T06:42:44Z)
Chinese Open Instruction Generalist: A Preliminary Release [33.81265396916227]
We propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora.
arXiv Detail & Related papers (2023-04-17T04:45:06Z)
Efficient Inference for Multilingual Neural Machine Translation [60.10996883354372]
We consider several ways to make multilingual NMT faster at inference without degrading its quality. Our experiments demonstrate that combining a shallow decoder with vocabulary filtering leads to more than twice faster inference with no loss in translation quality.
arXiv Detail & Related papers (2021-09-14T13:28:13Z)
MOROCCO: Model Resource Comparison Framework [61.444083353087294]
We present MOROCCO, a framework to compare language models compatible with ttjiant environment which supports over 50 NLU tasks. We demonstrate its applicability for two GLUE-like suites in different languages.
arXiv Detail & Related papers (2021-04-29T13:01:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.