Related papers: Wisdom of collaborators: a peer-review approach to performance appraisal

Wisdom of collaborators: a peer-review approach to performance appraisal

URL: http://arxiv.org/abs/1912.12861v1
Date: Mon, 30 Dec 2019 09:23:51 GMT
Title: Wisdom of collaborators: a peer-review approach to performance appraisal
Authors: Sofia Dokuka, Ivan Zaikin, Kate Furman, Maksim Tsvetovat and Alex Furman
Abstract summary: We propose a novel metric, the Peer Rank Score (PRS), that evaluates individual reputations and the non-quantifiable individual impact. PRS is based on pairwise comparisons of employees. We show high robustness of the algorithm on simulations and empirically validate it for a genetic testing company on more than one thousand employees.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Individual performance and reputation within a company are major factors that influence wage distribution, promotion and firing. Due to the complexity and collaborative nature of contemporary business processes, the evaluation of individual impact in the majority of organizations is an ambiguous and non-trivial task. Existing performance appraisal approaches are often affected by individuals biased judgements, and organizations are dissatisfied with the results of evaluations. We assert that employees can provide accurate measurement of their peer performance in a complex collaborative environment. We propose a novel metric, the Peer Rank Score (PRS), that evaluates individual reputations and the non-quantifiable individual impact. PRS is based on pairwise comparisons of employees. We show high robustness of the algorithm on simulations and empirically validate it for a genetic testing company on more than one thousand employees using peer reviews over the course of three years.

Related papers

Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases. In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Employee Turnover Prediction: A Cross-component Attention Transformer with Consideration of Competitor Influence and Contagious Effect [12.879229546467117]
We propose a novel deep learning approach based on job embeddedness theory to predict the turnovers of individual employees across different firms. Our developed method demonstrates superior performance over several state-of-the-art benchmark methods.
arXiv Detail & Related papers (2025-01-31T22:25:39Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z)
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse [9.542503507653494]
Chain-of-thought (CoT) has become a widely used strategy for working with large language and multimodal models. We identify characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology. We find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance when using inference-time reasoning.
arXiv Detail & Related papers (2024-10-27T18:30:41Z)
(De)Noise: Moderating the Inconsistency Between Human Decision-Makers [15.291993233528526]
We study whether algorithmic decision aids can be used to moderate the degree of inconsistency in human decision-making in the context of real estate appraisal. We find that both (i) asking respondents to review their estimates in a series of algorithmically chosen pairwise comparisons and (ii) providing respondents with traditional machine advice are effective strategies for influencing human responses.
arXiv Detail & Related papers (2024-07-15T20:24:36Z)
Mitigating Cognitive Biases in Multi-Criteria Crowd Assessment [22.540544209683592]
We focus on cognitive biases associated with a multi-criteria assessment in crowdsourcing. Crowdworkers who rate targets with multiple different criteria simultaneously may provide biased responses due to prominence of some criteria or global impressions of the evaluation targets. We propose two specific model structures for Bayesian opinion aggregation models that consider inter-criteria relations.
arXiv Detail & Related papers (2024-07-10T16:00:23Z)
360$^\circ$REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System [71.96888731208838]
We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. We propose Reusable Experience Accumulation with 360$circ$ Assessment (360$circ$REA), a hierarchical multi-agent framework inspired by corporate organizational practices.
arXiv Detail & Related papers (2024-04-08T14:43:13Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language. LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments. We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Individualized Policy Evaluation and Learning under Clustered Network Interference [3.8601741392210434]
We consider the problem of evaluating and learning an optimal individualized treatment rule (ITR) under clustered network interference. We propose an estimator that can be used to evaluate the empirical performance of an ITR. We derive the finite-sample regret bound for a learned ITR, showing that the use of our efficient evaluation estimator leads to the improved performance of learned policies.
arXiv Detail & Related papers (2023-11-04T17:58:24Z)
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations. We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
Measuring the Effect of Influential Messages on Varying Personas [67.1149173905004]
We present a new task, Response Forecasting on Personas for News Media, to estimate the response a persona might have upon seeing a news message. The proposed task not only introduces personalization in the modeling but also predicts the sentiment polarity and intensity of each response. This enables more accurate and comprehensive inference on the mental state of the persona.
arXiv Detail & Related papers (2023-05-25T21:01:00Z)
Improving Peer Assessment with Graph Convolutional Networks [2.105564340986074]
Peer assessment might not be as accurate as expert evaluations, thus rendering these systems unreliable. We first model peer assessment as multi-relational weighted networks that can express a variety of peer assessment setups. We introduce a graph convolutional network which can learn assessment patterns and user behaviors to more accurately predict expert evaluations.
arXiv Detail & Related papers (2021-11-04T03:43:09Z)
Catch Me if I Can: Detecting Strategic Behaviour in Peer Assessment [61.24399136715106]
We consider the issue of strategic behaviour in various peer-assessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. Our focus is on designing methods for detection of such manipulations. Specifically, we consider a setting in which agents evaluate a subset of their peers and output rankings that are later aggregated to form a final ordering.
arXiv Detail & Related papers (2020-10-08T15:08:40Z)
The cost of coordination can exceed the benefit of collaboration in performing complex tasks [0.0]
dyads gradually improve in performance but do not experience a collective benefit compared to individuals in most situations. Having an additional expert in the dyad who is adequately trained improves accuracy. Findings highlight that the extent of training received by an individual, the complexity of the task at hand, and the desired performance indicator are all critical factors that need to be accounted for when weighing up the benefits of collective decision-making.
arXiv Detail & Related papers (2020-09-23T10:18:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.