Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts
- URL: http://arxiv.org/abs/2511.05627v1
- Date: Fri, 07 Nov 2025 02:44:00 GMT
- Title: Assessing the Reliability of Large Language Models in the Bengali Legal Context: A Comparative Evaluation Using LLM-as-Judge and Legal Experts
- Authors: Sabik Aftahee, A. F. M. Farhad, Arpita Mallik, Ratnajit Dhar, Jawadul Karim, Nahiyan Bin Noor, Ishmam Ahmed Solaiman,
- Abstract summary: Generative AI models like OpenAI GPT-4.1 Mini, Gemini 2.0 Flash, Meta Llama 3 70B, and DeepSeek R1 could potentially democratize legal assistance.<n>In this study, we collected 250 authentic legal questions from the Facebook group "Know Your Rights"<n>We evaluated each AI-generated response across four critical dimensions: factual accuracy, legal appropriateness, completeness, and clarity.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accessing legal help in Bangladesh is hard. People face high fees, complex legal language, a shortage of lawyers, and millions of unresolved court cases. Generative AI models like OpenAI GPT-4.1 Mini, Gemini 2.0 Flash, Meta Llama 3 70B, and DeepSeek R1 could potentially democratize legal assistance by providing quick and affordable legal advice. In this study, we collected 250 authentic legal questions from the Facebook group "Know Your Rights," where verified legal experts regularly provide authoritative answers. These questions were subsequently submitted to four four advanced AI models and responses were generated using a consistent, standardized prompt. A comprehensive dual evaluation framework was employed, in which a state-of-the-art LLM model served as a judge, assessing each AI-generated response across four critical dimensions: factual accuracy, legal appropriateness, completeness, and clarity. Following this, the same set of questions was evaluated by three licensed Bangladeshi legal professionals according to the same criteria. In addition, automated evaluation metrics, including BLEU scores, were applied to assess response similarity. Our findings reveal a complex landscape where AI models frequently generate high-quality, well-structured legal responses but also produce dangerous misinformation, including fabricated case citations, incorrect legal procedures, and potentially harmful advice. These results underscore the critical need for rigorous expert validation and comprehensive safeguards before AI systems can be safely deployed for legal consultation in Bangladesh.
Related papers
- LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [67.71760070255425]
We introduce PLawBench, a practical benchmark for evaluating large language models (LLMs) in legal practice scenarios.<n>PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics.<n>Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-23T11:36:10Z) - VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models [0.4310799044841232]
The Vietnamese Legal Benchmark (VLegal-Bench) is the first benchmark designed to assess large language models (LLMs) on Vietnamese legal tasks.<n>The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline.<n>By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts.
arXiv Detail & Related papers (2025-12-16T16:28:32Z) - Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z) - ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation [56.79698529022327]
Legal claims refer to the plaintiff's demands in a case and are essential to guiding judicial reasoning and case resolution.<n>This paper explores the problem of legal claim generation based on the given case's facts.<n>We construct ClaimGen-CN, the first dataset for Chinese legal claim generation task.
arXiv Detail & Related papers (2025-08-24T07:19:25Z) - Legal Assist AI: Leveraging Transformer-Based Model for Effective Legal Assistance [0.18749305679160366]
Many citizens in India struggle to leverage their legal rights due to limited awareness and access to relevant legal information.<n>This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs)<n>The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE.
arXiv Detail & Related papers (2025-05-28T06:06:53Z) - LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation [68.26911563519162]
Legal consultation is essential for safeguarding individual rights and ensuring access to justice.<n>Current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations.<n>LeCoDe is a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns.
arXiv Detail & Related papers (2025-05-26T08:24:32Z) - LEXam: Benchmarking Legal Reasoning on 340 Law Exams [76.3521146499006]
We introduce textscLEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.<n>Our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities.
arXiv Detail & Related papers (2025-05-19T08:48:12Z) - Exploring Possibilities of AI-Powered Legal Assistance in Bangladesh through Large Language Modeling [0.0]
This research seeks to develop a specialized Large Language Model (LLM) to assist in the Bangladeshi legal system.
We created UKIL-DB-EN, an English corpus of Bangladeshi legal documents, by collecting and scraping data on various legal acts.
We fine-tuned the GPT-2 model on this dataset to develop GPT2-UKIL-EN, an LLM focused on providing legal assistance in English.
arXiv Detail & Related papers (2024-10-22T17:34:59Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Interpretable Long-Form Legal Question Answering with
Retrieval-Augmented Large Language Models [10.834755282333589]
Long-form Legal Question Answering dataset comprises 1,868 expert-annotated legal questions in the French language.
Our experimental results demonstrate promising performance on automatic evaluation metrics.
As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains.
arXiv Detail & Related papers (2023-09-29T08:23:19Z) - Legal Question-Answering in the Indian Context: Efficacy, Challenges,
and Potential of Modern AI Models [3.552993426200889]
Legal QA platforms bear the promise to metamorphose the manner in which legal experts engage with jurisprudential documents.
Our discourse zeroes in on an array of retrieval and QA mechanisms, positioning the OpenAI GPT model as a reference point.
The ambit of this study is tethered to the Indian criminal legal landscape, distinguished by its intricate nature and associated logistical constraints.
arXiv Detail & Related papers (2023-09-26T07:56:55Z) - Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model [30.30848216845138]
Chatlaw is an innovative legal assistant utilizing a Mixture-of-Experts (MoE) model and a multi-agent system.
By integrating knowledge graphs with artificial screening, we construct a high-quality legal dataset to train the MoE model.
Our MoE model outperforms GPT-4 in the Lawbench and Unified Exam Qualification for Legal Professionals by 7.73% in accuracy and 11 points, respectively.
arXiv Detail & Related papers (2023-06-28T10:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.