Related papers: A Proposed Framework for the Comprehensive Scalability Assessment of ICTD Projects

Related papers

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem [87.30601926271864]
InnoEval is a deep innovation evaluation framework designed to emulate human-level idea assessment.<n>We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources.<n>We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval.
arXiv Detail & Related papers (2026-02-16T00:40:31Z)
Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development [33.01897134024342]
Development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering.<n>This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents.
arXiv Detail & Related papers (2025-11-06T05:10:04Z)
CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task [49.27354010985993]
Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities.<n>Existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities.<n>We propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions.
arXiv Detail & Related papers (2025-09-29T08:10:29Z)
The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [57.1838332916627]
Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP)<n>Their widespread deployment has also raised significant safety concerns.<n>LLMs-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts.
arXiv Detail & Related papers (2025-06-06T05:50:50Z)
Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools [0.0]
This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. We first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. We developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment.
arXiv Detail & Related papers (2025-03-27T13:34:36Z)
An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI) This paper explores potential areas where statisticians can make important contributions to the development of LLMs. We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z)
Evaluating Step-by-step Reasoning Traces: A Survey [3.895864050325129]
We propose a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility) We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria.
arXiv Detail & Related papers (2025-02-17T19:58:31Z)
The Science of Evaluating Foundation Models [46.973855710909746]
This work focuses on three key aspects: (1) Formalizing the Evaluation Process by providing a structured framework tailored to specific use-case contexts; (2) Offering Actionable Tools and Frameworks such as checklists and templates to ensure thorough, reproducible, and practical evaluations; and (3) Surveying Recent Work with a targeted review of advancements in LLM evaluation, emphasizing real-world applications.
arXiv Detail & Related papers (2025-02-12T22:55:43Z)
Using Sustainability Impact Scores for Software Architecture Evaluation [5.33605239628904]
We present an improved version of the Sustainability Impact Score (SIS) The SIS facilitates the identification and quantification of trade-offs in terms of their sustainability impact. Our study reveals that technical quality concerns have significant, often unrecognized impacts across sustainability dimensions.
arXiv Detail & Related papers (2025-01-28T15:00:45Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
A Survey of Event Causality Identification: Principles, Taxonomy, Challenges, and Assessment [6.492836595169771]
Event Causality Identification (ECI) has become a crucial task in Natural Language Processing (NLP) Our taxonomy classifies ECI methods according to the two primary tasks of sentence-level (SECI) and document-level (DECI) event causality identification.
arXiv Detail & Related papers (2024-11-15T17:19:42Z)
A Comprehensive Survey on Evidential Deep Learning and Its Applications [64.83473301188138]
Evidential Deep Learning (EDL) provides reliable uncertainty estimation with minimal additional computation in a single forward pass. We first delve into the theoretical foundation of EDL, the subjective logic theory, and discuss its distinctions from other uncertainty estimation frameworks. We elaborate on its extensive applications across various machine learning paradigms and downstream tasks.
arXiv Detail & Related papers (2024-09-07T05:55:06Z)
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation [46.59416831869014]
We propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination.
arXiv Detail & Related papers (2024-08-06T16:28:30Z)
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z)
Evaluating Human-AI Collaboration: A Review and Methodological Framework [4.41358655687435]
The use of artificial intelligence (AI) in working environments with individuals, known as Human-AI Collaboration (HAIC), has become essential. evaluating HAIC's effectiveness remains challenging due to the complex interaction of components involved. This paper provides a detailed analysis of existing HAIC evaluation approaches and develops a fresh paradigm for more effectively evaluating these systems.
arXiv Detail & Related papers (2024-07-09T12:52:22Z)
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms. Nearly 1,200 teams from across the world participated. We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [51.26815896167173]
We present a comprehensive tertiary analysis of PAMI reviews along three complementary dimensions.<n>Our analyses reveal distinctive organizational patterns as well as persistent gaps in current review practices.<n>Finally, our evaluation of state-of-the-art AI-generated reviews indicates encouraging advances in coherence and organization.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z)
Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z)
Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z)
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap [47.632123167141245]
We argue that model evaluation practices must take on a critical task to cope with the challenges and responsibilities brought by this homogenization.<n>We urge the community to develop evaluation methods based on real-world contexts and human requirements.
arXiv Detail & Related papers (2023-06-01T00:01:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.