Related papers: Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports

URL: http://arxiv.org/abs/2507.00742v1
Date: Tue, 01 Jul 2025 13:46:00 GMT
Title: Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports
Authors: Carlos Caminha, Maria de Lourdes M. Silva, Iago C. Chaves, Felipe T. Brito, Victor A. E. Farias, Javam C. Machado,
Abstract summary: Large Language Models (LLMs) have shown promise in addressing such issues.<n>This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies.<n>Three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it.
Score: 0.43981305860983716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer manufacturers offer platforms for users to describe device faults using textual reports such as "My screen is flickering". Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). We conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. We achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it, that offer competitive performance with lower VRAM usage, enabling efficient inference on end-user devices as modern laptops or smartphones with NPUs.

Related papers

LFM2 Technical Report [87.58431408281973]
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities.<n>The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active)<n>We build multimodal and retrieval variants: LFM2-VL for vision-latency tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval.
arXiv Detail & Related papers (2025-11-28T17:56:35Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation [1.6752458252726459]
NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited metrics.<n>We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, and an Example Retriever.<n>We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B)
arXiv Detail & Related papers (2025-10-15T21:37:02Z)
Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs [6.6893292050680655]
We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs.<n>We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions.<n>We find the Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score.
arXiv Detail & Related papers (2025-09-12T07:10:55Z)
Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets [0.0]
We introduce GPR-bench, a benchmark that operationalizes regression testing for general purpose use cases.<n>We show that newer models generally improve correctness, but the differences are modest and not statistically significant.<n>In contrast, the concise-writing instruction significantly enhances conciseness, demonstrating the effectiveness of prompt engineering.
arXiv Detail & Related papers (2025-05-02T12:31:43Z)
Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs [0.28207011158655404]
This paper defines and explores the design space for information extraction from layout-rich documents using large language models (LLMs)<n>Our study delves into the sub-problems within these core challenges, such as input representation, chunking, prompting, and selection of LLMs and multimodal models.<n>It examines the outcomes of different design choices through a new layout-aware IE test suite, benchmarking against the state-of-art (SoA) model LayoutLMv3.
arXiv Detail & Related papers (2025-02-25T13:11:53Z)
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results [55.33807002543901]
We present AIvaluateXR, a comprehensive evaluation framework for benchmarking large language models (LLMs) running on XR devices.<n>We deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation.<n>We propose a unified evaluation method based on the 3D Optimality theory to select the optimal device-model pairs from quality and speed objectives.
arXiv Detail & Related papers (2025-02-13T20:55:48Z)
Reasoning Robustness of LLMs to Adversarial Typographical Errors [49.99118660264703]
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. We study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries. We design an Adversarial Typo Attack ($texttATA$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking.
arXiv Detail & Related papers (2024-11-08T05:54:05Z)
StructuredRAG: JSON Response Formatting with Large Language Models [0.3141085922386211]
We introduce StructuredRAG, a benchmark of six tasks designed to assess Large Language Models' proficiency in following response format instructions. We evaluate two state-of-the-art LLMs, Gemini 1.5 Pro and Llama 3 8B-instruct with 4-bit quantization. We find that Llama 3 8B-instruct often performs competitively with Gemini 1.5 Pro.
arXiv Detail & Related papers (2024-08-07T19:32:59Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications. These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task. One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z)
T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation [55.16845189272573]
T2I-CompBench++ is an enhanced benchmark for compositional text-to-image generation.<n>It comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions.
arXiv Detail & Related papers (2023-07-12T17:59:42Z)
Dual Learning for Large Vocabulary On-Device ASR [64.10124092250128]
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. We provide an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
arXiv Detail & Related papers (2023-01-11T06:32:28Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)
LiteMuL: A Lightweight On-Device Sequence Tagger using Multi-task Learning [1.3192560874022086]
LiteMuL is a lightweight on-device sequence tagger that can efficiently process the user conversations using a Multi-Task Learning approach. Our model is competitive with other MTL approaches for NER and POS tasks while outshines them with a low memory footprint.
arXiv Detail & Related papers (2020-12-15T19:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.