Related papers: PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

URL: http://arxiv.org/abs/2506.18728v2
Date: Thu, 26 Jun 2025 16:35:54 GMT
Title: PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
Authors: Steven Kolawole, Keshav Santhanam, Virginia Smith, Pratiksha Thaker,
Abstract summary: We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts.<n>Our dataset comprises over 37,000 real-world prompts from public LLM chat logs.<n>We provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity.
Score: 16.40921376558516
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism--decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.

Related papers

A Semantic Parsing Framework for End-to-End Time Normalization [10.472379345636845]
Time normalization is the task of converting natural language temporal expressions into machine-readable representations.<n>Traditional systems based on the ISO-TimeML schema limit expressivity.<n>We introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework.
arXiv Detail & Related papers (2025-07-08T23:30:11Z)
Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning [6.700515856842664]
We present Team asdfo123's submission to the LLMSR@XLLM25 shared task.<n>We evaluate large language models on producing fine-grained, controllable, and interpretable reasoning processes.<n>Our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines.
arXiv Detail & Related papers (2025-05-18T09:46:30Z)
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding [26.571743941748238]
PASTA is a learning-based system that teaches large language models to identify semantic independence and express parallel decoding opportunities in their own responses.<n> PASTA-Lang is an annotation language that enables LLMs to express semantic independence in their own responses.<n>Our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.
arXiv Detail & Related papers (2025-02-17T07:39:16Z)
LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE)<n>LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients.<n>It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z)
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [10.712756715779822]
Large Language Models (LLMs) have shown promise in data processing.<n>These frameworks focus on reducing cost when executing user-specified operations.<n>This is problematic for complex tasks and data.<n>We present DocETL, a system that optimize complex document processing pipelines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z)
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks. We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z)
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.