Related papers: StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

URL: http://arxiv.org/abs/2502.14494v1
Date: Thu, 20 Feb 2025 12:22:18 GMT
Title: StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Authors: Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu,
Abstract summary: Multi-turn instruction following capability constitutes a core competency of large language models.<n>We propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling.
Score: 13.077503628759446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at \url{https://github.com/MLGroupJLU/StructFlowBench}.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing. We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z)
Neural Contextual Reinforcement Framework for Logical Structure Language Generation [1.08272575635683]
The framework integrates custom reward functions and dynamic context alignment mechanisms.<n>It produces outputs that align closely with human expectations of logical structure and semantic flow.<n>It exhibits robustness in handling noisy input data and scalability across varying model sizes.
arXiv Detail & Related papers (2025-01-20T11:34:28Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models on their ability to produce structured outputs.<n>We demonstrate that StructTest serves as a good proxy for general reasoning abilities.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models [62.98273649512654]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks. Increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. We propose a novel approach that relaxes the constraint imposed by regular structural pruning methods.
arXiv Detail & Related papers (2024-10-15T18:51:18Z)
A Large Language Model and Denoising Diffusion Framework for Targeted Design of Microstructures with Commands in Natural Language [0.0]
We propose a framework that integrates Natural Language Processing (NLP), Large Language Models (LLMs), and Denoising Diffusion Probabilistic Models (DDPMs) Our framework employs contextual data augmentation, driven by a pretrained LLM, to generate and expand a diverse dataset of microstructure descriptors. A retrained NER model extracts relevant microstructure descriptors from user-provided natural language inputs, which are then used by the DDPM to generate microstructures with targeted mechanical properties and topological features.
arXiv Detail & Related papers (2024-09-22T14:45:22Z)
Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries [54.325172923155414]
We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts.
arXiv Detail & Related papers (2024-09-19T10:38:01Z)
An LLM Feature-based Framework for Dialogue Constructiveness Assessment [8.87747076871578]
Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructiveness outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models or neural models such as pre-trained language models. We propose an LLM feature-based framework for dialogue constructiveness assessment that combines the strengths of feature-based and neural approaches.
arXiv Detail & Related papers (2024-06-20T22:10:52Z)
SymbolicAI: A framework for logic-based approaches combining generative models and solvers [9.841285581456722]
We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. We treat large language models (LLMs) as semantic solvers that execute tasks based on both natural and formal language instructions.
arXiv Detail & Related papers (2024-02-01T18:50:50Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs. Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z)
Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn. We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.