Measuring Plagiarism in Introductory Programming Course Assignments
- URL: http://arxiv.org/abs/2205.08520v1
- Date: Fri, 29 Apr 2022 17:06:26 GMT
- Title: Measuring Plagiarism in Introductory Programming Course Assignments
- Authors: Muhammad Humayoun, Muhammad Adnan Hashmi and Ali Hanzala Khan
- Abstract summary: This paper discusses the methods of plagiarism and its detection in introductory programming course assignments written in C++.
A general framework is developed that uses the three token-based similarity methods as features and predicts if the solution is plagiarized.
We achieved an F1 score of 0.955 and 0.971 on original and synthetic datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Measuring plagiarism in programming assignments is an essential task to the
educational procedure. This paper discusses the methods of plagiarism and its
detection in introductory programming course assignments written in C++. A
small corpus of assignments is made publically available. A general framework
to compute the similarity between a solution pair is developed that uses the
three token-based similarity methods as features and predicts if the solution
is plagiarized. The importance of each feature is also measured, which in
return ranks the effectiveness of each method in use. Finally, the artificially
generated dataset improves the results compared to the original data. We
achieved an F1 score of 0.955 and 0.971 on original and synthetic datasets.
Related papers
- The Failure of Plagiarism Detection in Competitive Programming [0.0]
Plagiarism in programming courses remains a persistent challenge.<n>This paper examines why traditional code plagiarism detection methods frequently fail in competitive programming contexts.<n>We find that widely-used automated similarity checkers can be thwarted by simple code transformations or novel AI-generated code.
arXiv Detail & Related papers (2025-05-13T05:43:49Z) - SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers [16.80818230868491]
This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers.
To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024.
Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions.
arXiv Detail & Related papers (2025-03-31T22:02:24Z) - BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection System [0.0]
We propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets.
We also propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy.
Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score.
arXiv Detail & Related papers (2024-04-01T12:20:34Z) - A Fixed-Point Approach to Unified Prompt-Based Counting [51.20608895374113]
This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for objects indicated by various prompt types, such as box, point, and text.
Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.
arXiv Detail & Related papers (2024-03-15T12:05:44Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to
Document Level [4.250876580245865]
Existing AI-generated text classifiers have limited accuracy and often produce false positives.
We propose a novel approach using natural language processing (NLP) techniques.
We generate multiple paraphrased versions of a given question and inputting them into the large language model to generate answers.
By using a contrastive loss function based on cosine similarity, we match generated sentences with those from the student's response.
arXiv Detail & Related papers (2023-06-13T20:34:55Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Revisiting text decomposition methods for NLI-based factuality scoring
of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring.
We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z) - A Framework and Benchmarking Study for Counterfactual Generating Methods
on Tabular Data [0.0]
Counterfactual explanations are viewed as an effective way to explain machine learning predictions.
There are already dozens of algorithms aiming to generate such explanations.
benchmarking study and framework can help practitioners in determining which technique and building blocks most suit their context.
arXiv Detail & Related papers (2021-07-09T21:06:03Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - DRIVE: One-bit Distributed Mean Estimation [16.41391088542669]
We consider the problem where $n$ clients transmit $d$-dimensional real-valued vectors using only $d(1+o(1))$ bits each.
We derive corresponding new algorithms that outperform previous compression algorithms in accuracy and computational efficiency.
arXiv Detail & Related papers (2021-05-18T08:03:39Z) - Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer
Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification.
Multiple global proxies are leveraged to collectively approximate the original data points for each class.
We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.