Related papers: Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study

Related papers

Safe and Certifiable AI Systems: Concepts, Challenges, and Lessons Learned [45.44933002008943]
This white paper presents the T"UV AUSTRIA Trusted AI framework.<n>It is an end-to-end audit catalog and methodology for assessing and certifying machine learning systems.<n>Building on three pillars - Secure Software Development, Functional Requirements, and Ethics & Data Privacy - it translates the high-level obligations of the EU AI Act into specific, testable criteria.
arXiv Detail & Related papers (2025-09-08T17:52:08Z)
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z)
INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance [48.22571187209047]
INSEva is a Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance.<n> INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension.<n>Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses.
arXiv Detail & Related papers (2025-08-27T03:13:40Z)
Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment [0.0]
Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests.<n>This study is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment.
arXiv Detail & Related papers (2025-08-23T18:41:30Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning [53.92712851223158]
We formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory.<n>Under the CI framework, we align our model with three critical regulatory standards: EU AI Act, and HIPAA.<n>We employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms.
arXiv Detail & Related papers (2025-05-20T16:40:09Z)
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z)
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices. It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues. Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z)
Who is Responsible? The Data, Models, Users or Regulations? A Comprehensive Survey on Responsible Generative AI for a Sustainable Future [7.976680307696195]
Responsible Artificial Intelligence (RAI) has emerged as a crucial framework for addressing ethical concerns in the development and deployment of AI systems. This article examines the challenges and opportunities in implementing ethical, transparent, and accountable AI systems in the post-ChatGPT era.
arXiv Detail & Related papers (2025-01-15T20:59:42Z)
The Fundamental Rights Impact Assessment (FRIA) in the AI Act: Roots, legal obligations and key elements for a model template [55.2480439325792]
Article aims to fill existing gaps in the theoretical and methodological elaboration of the Fundamental Rights Impact Assessment (FRIA) This article outlines the main building blocks of a model template for the FRIA. It can serve as a blueprint for other national and international regulatory initiatives to ensure that AI is fully consistent with human rights.
arXiv Detail & Related papers (2024-11-07T11:55:55Z)
Where Assessment Validation and Responsible AI Meet [0.0876953078294908]
We propose a unified assessment framework that considers classical test validation theory and assessment-specific and domain-agnostic RAI principles and practice. The framework addresses responsible AI use for assessment that supports validity arguments, alignment with AI ethics to maintain human values and oversight, and broader social responsibility associated with AI use.
arXiv Detail & Related papers (2024-11-04T20:20:29Z)
Responsible AI Question Bank: A Comprehensive Tool for AI Risk Assessment [17.026921603767722]
The study introduces our Responsible AI (RAI) Question Bank, a comprehensive framework and tool designed to support diverse AI initiatives. By integrating AI ethics principles such as fairness, transparency, and accountability into a structured question format, the RAI Question Bank aids in identifying potential risks.
arXiv Detail & Related papers (2024-08-02T22:40:20Z)
Argument Quality Assessment in the Age of Instruction-Following Large Language Models [45.832808321166844]
A critical task in any such application is the assessment of an argument's quality. We identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment. We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment.
arXiv Detail & Related papers (2024-03-24T10:43:21Z)
Quality Assurance for Artificial Intelligence: A Study of Industrial Concerns, Challenges and Best Practices [14.222404866137756]
We report on the challenges and best practices of quality assurance for AI systems (QA4AI) Our findings suggest correctness as the most important property, followed by model relevance, efficiency and deployability. We identify 21 QA4AI practices across each stage of AI development.
arXiv Detail & Related papers (2024-02-26T08:31:45Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents [80.5213198675411]
Large language models (LLMs) have dramatically enhanced the field of language intelligence. LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. Recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents.
arXiv Detail & Related papers (2023-11-20T14:30:55Z)
Test-takers have a say: understanding the implications of the use of AI in language tests [12.430886405811757]
This study is the first empirical study aimed at identifying the implications of AI adoption in language tests from a test-taker perspective. We identify that AI integration may enhance perceptions of fairness, consistency, and availability. It might incite mistrust regarding reliability and interactivity aspects, subsequently influencing the behaviors and well-being of test-takers.
arXiv Detail & Related papers (2023-07-19T10:28:59Z)
Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner. The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z)
Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts. Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks. We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z)
An interdisciplinary conceptual study of Artificial Intelligence (AI) for helping benefit-risk assessment practices: Towards a comprehensive qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence. The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z)
Multisource AI Scorecard Table for System Evaluation [3.74397577716445]
The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist. The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system.
arXiv Detail & Related papers (2021-02-08T03:37:40Z)
Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations. We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.