Related papers: LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

URL: http://arxiv.org/abs/2505.19667v1
Date: Mon, 26 May 2025 08:24:32 GMT
Title: LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
Authors: Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, Xiaozhong Liu,
Abstract summary: Legal consultation is essential for safeguarding individual rights and ensuring access to justice.<n>Current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations.<n>LeCoDe is a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns.
Score: 42.52284832752026
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

Related papers

Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z)
LegalAgentBench: Evaluating LLM Agents in Legal Domain [53.70993264644004]
LegalAgentBench is a benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.<n>LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge.
arXiv Detail & Related papers (2024-12-23T04:02:46Z)
Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models [13.067312163677933]
We propose a novel approach integrating Large Language Models with specially designed prompts to address precision requirements in legal Artificial Intelligence (LegalAI) applications. To validate this method, we introduce a curated dataset tailored to precision-oriented LegalAI tasks.
arXiv Detail & Related papers (2024-07-26T18:46:39Z)
LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation [1.9857357818932064]
LawLuo is a multi-agent framework for multi-turn Chinese legal consultations.<n>LawLuo includes four agents: the receptionist agent, which assesses user intent and selects a lawyer agent; the lawyer agent, which interacts with the user; the secretary agent, which organizes conversation records and generates consultation reports.<n>These agents' interactions mimic the operations of real law firms.
arXiv Detail & Related papers (2024-07-23T07:40:41Z)
InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z)
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval [16.29803062332164]
We propose a few-shot approach where large language models assist in generating expert-aligned relevance judgments.<n>The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators.<n>It also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process.
arXiv Detail & Related papers (2024-03-27T09:46:56Z)
ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between Professional and Non-Professional Responses [56.949741308866535]
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications.<n>This study introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers.
arXiv Detail & Related papers (2024-03-14T06:49:16Z)
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [58.33076950775072]
MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
arXiv Detail & Related papers (2024-02-22T18:21:59Z)
Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues [4.738985706520995]
This work aims to systematically analyze the multifaceted capabilities of LLMs across diverse dialogue scenarios. Our analysis highlights GPT-4's superior performance in many tasks while identifying specific challenges.
arXiv Detail & Related papers (2024-02-21T06:11:03Z)
(A)I Am Not a Lawyer, But...: Engaging Legal Experts towards Responsible LLM Policies for Legal Advice [8.48013392781081]
Large language models (LLMs) are increasingly capable of providing users with advice in a wide range of professional domains, including legal advice. We conducted workshops with 20 legal experts using methods inspired by case-based reasoning. Our findings reveal novel legal considerations, such as unauthorized practice of law, confidentiality, and liability for inaccurate advice.
arXiv Detail & Related papers (2024-02-02T19:35:34Z)
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications. Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.